# Re-Evaluating Chicago Community Area Boundaries Using Socio-economic Similarity and Spatially Constrained Network Clustering

**Author:** Rudraksha Ravindra Kokane  
**University ID:** A20586373  
**Course:** CS579 Online Social Network Analysis  
**University:** Illinois Institute of Technology

---

## 1. Introduction

This project investigates the application of spatial regionalization techniques to understand and visualize socioeconomic disparities across Cook County, Illinois. The primary objective is to identify geographically contiguous clusters of block groups that are homogeneous with respect to key socioeconomic status (SES) indicators.

### Motivation

Traditional approaches to understanding regional inequality rely upon fixed administrative boundaries such as counties and municipalities. However, these boundaries frequently fail to align with actual socioeconomic patterns observed on the ground, creating a disconnect between political geography and social reality. This project was motivated by the recognition that a more data-driven approach could reveal the underlying structure of spatial socioeconomic variation.

The first objective was to discover natural clustering patterns in the data, identifying regions that share similar socioeconomic characteristics while maintaining spatial contiguity. Rather than accepting arbitrary political boundaries, this approach allows regions to emerge organically from the data itself, with the constraint that neighboring areas should be grouped together. The second objective was to apply advanced spatial analysis techniques, specifically the SKATER algorithm (Spatial K-Cluster Analysis by Tree Edge Removal), a sophisticated regionalization methodology implemented in the Python geospatial ecosystem through the `spopt` library. SKATER represents the current state of the art in spatial regionalization, offering theoretical rigor and practical computational efficiency.

The third objective involved integrating American Community Survey data with TIGER/Line shapefiles from the U.S. Census Bureau. This required careful data wrangling and validation to create a unified spatial-attribute dataset suitable for advanced network and spatial analysis. Finally, the project aimed to address the question of optimal clustering by examining the trade-off between within-cluster homogeneity and model complexity. Rather than arbitrarily selecting a single clustering solution, I tested multiple cluster counts and evaluated them using the BSS/TSS (Between-group Sum of Squares / Total Sum of Squares) metric, providing quantitative evidence for or against different regionalization levels.

Through the execution of this project, I have developed comprehensive proficiency in multiple domains: data integration and transformation at scale, spatial analysis and regionalization methodology, statistical evaluation of clustering solutions, and geospatial visualization. The work demonstrates a complete analytical pipeline from raw Census data to interpretable spatial clusters suitable for policy analysis, planning, and further research.

---

## 2. Design of Project

### 2.1 Conceptual Framework

The project follows a well-established pipeline for spatial regionalization, moving sequentially through data collection, data preparation, spatial weights construction, clustering, evaluation, and visualization. This systematic approach ensures that each step builds logically upon the previous one, with appropriate validation and quality checks at each stage.

### 2.2 Methodological Components

The project unfolded in five distinct methodological phases, each contributing a critical piece to the final regionalization.

**Phase 1: Data Integration.** The analysis required integrating data from multiple sources at the block group level for Cook County, Illinois. Seven separate American Community Survey (ACS) data tables were extracted from the U.S. Census Bureau: Race and Ethnicity (B03002), Median Household Income (B19013), Poverty Status (B17021), Educational Attainment (B15003), Employment Status (B23025), Housing Tenure (B25003), and Age Distribution (B01001). These tables were read as CSV files, requiring careful handling of Census formatting conventions and data type conversions. The extensive preparation work at this stage laid the groundwork for all subsequent analysis.

**Phase 2: Spatial Weights Construction.** After preparing attribute data, the next critical task was defining spatial relationships between block groups through a weights matrix. Queen contiguity weights were constructed using the `libpysal` library, which defines two regions as neighbors if they share an edge or corner. This definition is the appropriate standard in spatial analysis for ensuring that clustering respects adjacency constraints. The weights matrix encodes the fundamental network structure that the SKATER algorithm later exploits.

**Phase 3: Attribute Standardization.** Raw socioeconomic variables span vastly different scales: population percentages range from 0 to 100, while median household income is measured in tens of thousands of dollars. To ensure that all attributes contribute equally to the clustering analysis, all socioeconomic variables were standardized using z-score normalization. This preprocessing step is essential because without standardization, income values would numerically dominate the clustering algorithm, potentially obscuring patterns in race/ethnicity composition, education levels, and other percentage-based indicators.

**Phase 4: SKATER Regionalization.** With properly prepared data and a spatial weights matrix in hand, the SKATER algorithm was applied to identify contiguous clusters. Rather than selecting a single cluster count arbitrarily, I tested multiple scenarios: k = 75, 80, 85, and 90 regions. This systematic exploration allowed me to examine how clustering quality changed as the model became progressively more complex. SKATER guarantees spatial contiguity and respects the floor parameter (minimum cluster size), preventing degenerate solutions with isolated single regions.

**Phase 5: Evaluation and Visualization.** Finally, each clustering solution was rigorously evaluated using multiple complementary criteria. The BSS/TSS ratio measured the proportion of total variance explained by between-cluster differences, with higher values indicating tighter, more homogeneous clusters. Cluster size distributions revealed whether regions were balanced or skewed, which matters for policy applications. Spatial visualization through choropleth maps communicated patterns intuitively and allowed visual inspection for artifacts or unexpected clustering patterns. Cluster-level descriptive statistics profiled the socioeconomic characteristics of each region, enabling interpretation of what each cluster represents in substantive terms.

---

## 3. Data Utilized

### 3.1 Data Sources and Characteristics

The analysis drew upon two primary data sources from the U.S. Census Bureau, both freely available and widely used in academic research. American Community Survey (ACS) 2020 data was obtained in the form of CSV files containing detailed demographic and economic indicators at the block group level. The Census Bureau provides these estimates through their data portal, with extensive documentation describing survey methodology, margins of error, and appropriate uses. The geographic boundaries for block groups came from the TIGER/Line Shapefiles for 2020, which provide vector representations of census geographic units in ESRI Shapefile format.

Cook County, Illinois was selected as the geographic focus. This county, containing Chicago and surrounding municipalities, encompasses approximately 5.3 million people across a rich tapestry of neighborhoods ranging from affluent suburban areas to economically disadvantaged urban cores. For this analysis, after filtering and removing rows with missing data in key socioeconomic variables, the dataset included approximately 1,900 block groups, each representing a relatively small geographic area typically containing 600-3,000 residents. The county FIPS code 17031 was used as the filtering criterion throughout the analysis.

### 3.2 Socioeconomic Variables

The final analysis incorporated eleven socioeconomic variables derived from the ACS tables. Total population came directly from B03002, providing the denominator for subsequent percentage calculations. Race and ethnicity were captured through four variables: percentage Non-Hispanic White, percentage Non-Hispanic Black, percentage Non-Hispanic Asian, and percentage Hispanic/Latino, all derived from table B03002. These four variables together account for over 95% of Cook County's population and represent the primary axis of racial/ethnic variation in the region.

Median household income, drawn directly from table B19013, measures the central tendency of income distribution and represents household economic resources. The poverty rate was computed from table B17021 as the percentage of the population living below the poverty threshold, providing a complementary measure of economic disadvantage. Educational attainment came from table B15003; specifically, the percentage of the population age 25 and older with a bachelor's degree or higher education serves as a measure of human capital investment and is strongly associated with labor market outcomes.

Employment status from table B23025 was operationalized as the unemployment rate, representing the percentage of the labor force actively seeking work but unable to secure employment. Housing tenure variables from table B25003 were computed as percentages of owner-occupied and renter-occupied units, with the renter percentage particularly relevant as an indicator of residential stability and wealth accumulation through homeownership. These eleven variables comprehensively span the major dimensions of socioeconomic status: race/ethnicity, income, education, employment, and housing tenure.

### 3.3 Data Quality and Preparation

The Census Bureau's ACS data represents high-quality official statistics, but preprocessing was still necessary. Census tables are often distributed with formatting quirks: values are stored as strings with embedded formatting characters, GEOID values are embedded within a larger geographic identifier, and categorical summary levels must be filtered. All CSV files were read with string data types, then systematically converted to numeric format after cleaning. Safe division functions protected against division-by-zero errors when computing percentages and rates. 

A small number of block groups contained missing values in key socioeconomic variables, often because the Census Bureau suppresses data when sample sizes are too small to meet publication standards. These approximately 100 block groups with missing data were excluded from the clustering analysis, reducing the effective dataset to 1,900 units. The remaining dataset was verified against the TIGER/Line shapefile to ensure geographic coverage and to check for any data misalignment. The result was a clean, validated dataset ready for spatial analysis.

---

## 4. Execution of Project

### 4.1 Implementation Workflow

The project was executed as a sequential pipeline of eleven computational steps, each implemented as reusable code modules that could be validated independently and later combined into a comprehensive analysis narrative.

**Step 1: ACS Data Extraction and Merging.** The first major task involved ingesting seven separate ACS CSV files and consolidating them into a single coherent attribute table. Each file was read and validated for row counts and column structure. The critical step of extracting the twelve-digit GEOID from the Census Bureau's formatted GEO_ID field (which contained country-level prefixes) was implemented with string manipulation functions. All tables were then filtered to Cook County by selecting only those records where the GEOID started with the prefix "17031," the standard identifier for Cook County in Illinois. For each table, relevant columns were selected and renamed to create a consistent naming convention. Numeric conversion required careful handling: values were treated as strings initially, any thousands-separators and formatting characters were removed, and then conversion to float type was performed with explicit error handling for remaining non-numeric values. Derived metrics were computed using a safe division function that protected against division-by-zero by replacing zero denominators with NaN. The result was a master attribute table containing 1,900 block groups and 12 variables ready for merging with geospatial data.

**Step 2: Spatial Data Integration.** The TIGER/Line shapefile for Illinois block groups was read using GeoPandas, which automatically loads both geometry and associated attribute data. Since the shapefile covered all of Illinois, it was filtered to Cook County using the COUNTYFP field. The spatial data was then merged with the ACS attribute table using GEOID as the common key. This merge operation created a GeoDataFrame—a table that combines tabular attributes with geographic geometries, enabling subsequent spatial operations and visualization. The merged dataset was saved in multiple formats: as a Shapefile to preserve geometry and attributes in a GIS-compatible format, and as CSV for tabular analysis in statistical software.

**Step 3: Queen Contiguity Weights Construction.** The spatial relationships between block groups were formalized by computing a Queen contiguity weights matrix. Using the `libpysal` library, the weights matrix was constructed based on polygon adjacency, with each block group's neighbors defined as those sharing an edge or corner. Basic diagnostics were computed: the total number of regions matched the expected 1,900 block groups, and the average number of neighbors per region was approximately 5.8, a typical figure for Queen adjacency in urban geographies. A validation step exported the complete neighbor list to CSV, allowing inspection of individual neighborhoods and verification that adjacencies were computed correctly.

**Step 4: Data Validation and Column Verification.** A validation step confirmed that the integrated geospatial dataset contained only Cook County block groups and that all expected SES variables were present with appropriate data types. Column names were listed to verify they matched expectations, accounting for the CSV import limitation that truncated long names to a maximum of ten characters per field (a limitation of the older Shapefile DBF format). This step ensured data integrity before proceeding to clustering.

**Step 5: Single-Solution SKATER Clustering (k=75).** To explore the method's output, a preliminary clustering solution was computed with k=75 regions. A subset of five core socioeconomic variables was selected: percentage White (non-Hispanic), percentage Black (non-Hispanic), poverty rate, percentage with bachelor's degree or higher, and unemployment rate. These five variables capture the primary dimensions of socioeconomic variation. The variables were standardized using scikit-learn's StandardScaler to ensure equal contribution to the clustering distance metric. The SKATER algorithm was then applied with k=75 clusters, a floor parameter of 10 (minimum cluster size), and the "increase" strategy for handling isolated regions that have no spatial neighbors. The resulting solution, with 75 spatially contiguous clusters, was saved to a new Shapefile for examination.

**Step 6: Multi-Solution SKATER Range Analysis (k=75, 80, 85, 90).** To rigorously compare clustering solutions, the analysis was expanded to test multiple cluster counts. The feature set was also expanded to ten socioeconomic variables to provide a richer characterization of regional differences. Island block groups (those with no neighbors in the weights matrix) were explicitly removed, as they cannot participate meaningfully in contiguity-constrained clustering. The SKATER algorithm was then run separately for k ∈ {75, 80, 85, 90}, producing four distinct regionalizations. For each solution, the BSS/TSS metric was computed—this measure quantifies the proportion of total variance explained by between-cluster differences, ranging from 0 to 1, with higher values indicating tighter, more homogeneous clusters. All four clustering solutions were saved to a single output Shapefile with the cluster labels in separate columns, enabling side-by-side comparison.

**Step 7: Cluster Size Analysis (k=90).** The spatial distribution of block groups across clusters was analyzed for the k=90 solution. The distribution of cluster sizes was computed and summarized using standard descriptive statistics: mean, standard deviation, minimum, and maximum cluster sizes. The resulting frequency table was saved to CSV for further inspection and analysis. This step revealed whether the algorithm produced balanced clusters or whether certain regions were much larger or smaller than others—a consideration for policy applications where equally-sized units might be preferred.

**Step 8: Cluster Profile Analysis (k=80).** Socioeconomic profiles were computed at the cluster level by calculating mean values of all ten socioeconomic variables for each cluster. This produced a summary table showing the average characteristics of each region, enabling interpretation of what each cluster represents in substantive terms. For instance, some clusters emerged as high-income, highly educated areas with predominantly white populations, while others showed lower income, higher poverty, greater racial/ethnic diversity, and higher rental rates. These profiles provide context for understanding the regional structure revealed by the algorithm.

**Step 9: Model Quality Visualization.** A line plot was created showing the BSS/TSS metric as a function of cluster count (k). This visualization makes immediately apparent how clustering quality improves as the number of clusters increases, but also reveals the diminishing marginal returns at higher k values. This "elbow curve" is a standard tool in cluster analysis for identifying the optimal balance between model fit and simplicity.

**Step 10: Spatial Regionalization Map.** A choropleth map was created showing Cook County colored by cluster membership for the k=90 solution. This visualization is invaluable for understanding the geographic distribution of regions and identifying potential patterns or anomalies. The map uses a categorical color scheme to distinguish clusters and includes subtle boundaries between block groups to show the underlying spatial grain of the data.

**Step 11: Cluster Size Distribution Histogram.** A histogram showing the distribution of block group counts per cluster revealed that cluster sizes were relatively uniform—the SKATER algorithm with its floor parameter (minimum size constraint) successfully avoided the problem of producing clusters with vastly different numbers of constituent units. This uniformity is generally preferable for policy applications and further analysis.

### 4.2 Technical Environment and Tools

The analysis was implemented in Python 3.11, a mature and widely-used language in the scientific computing and data science communities. The choice of Python reflected both its powerful geospatial libraries and its extensive ecosystem for data science, ensuring that all analysis steps could be executed within a single, reproducible computational environment.

The pandas library (version 2.3.3) provided the foundation for tabular data manipulation, enabling efficient filtering, merging, and transformation of Census data. GeoPandas (version 1.1.1) extended pandas functionality to geospatial data, allowing seamless integration of attribute tables and geographic geometries. NumPy (version 2.3.5) provided the underlying numerical computing infrastructure, essential for efficient matrix operations and statistical calculations. 

The spatial analysis relied upon two key specialized libraries. The libpysal library (PySAL's core module) implemented the Queen contiguity weights matrix and provided diagnostic functions for spatial networks. The spopt library contained the SKATER algorithm implementation, wrapping the original Duque et al. methodology in a user-friendly interface consistent with scikit-learn conventions. Scikit-learn provided the StandardScaler function for attribute standardization, ensuring best practices in preprocessing for machine learning and clustering algorithms.

Visualization was accomplished using Matplotlib, the foundational Python plotting library, which was used to generate the BSS/TSS curve, the spatial regionalization map, and the cluster size distribution histogram. All visualizations were exported as PNG files at 300 DPI, suitable for professional publication.

The development environment consisted of Visual Studio Code as the primary text editor and integrated development environment (IDE), which provided syntax highlighting, debugging, and seamless Git integration. Jupyter Notebook was used as the interactive computing environment, allowing narrative text, Python code, and outputs to be combined in a single document that serves simultaneously as analysis documentation and reproducible computation record. A Python virtual environment was created using venv to isolate project dependencies and ensure reproducibility across different computing environments. All code execution occurred on a Windows system using PowerShell for command-line operations, with the complete pipeline executable on standard laptop hardware within reasonable timeframes.

---

## 5. Results

### 5.1 Data Integration Outcomes

The data integration phase successfully consolidated Census information from seven separate tables into a unified geospatial dataset. The ACS data extraction and merge process resulted in approximately 1,900 Cook County block groups with complete information across all eleven socioeconomic variables. The raw Census estimates were successfully converted to derived metrics: four percentage-based racial/ethnic composition variables, a poverty rate, an unemployment rate, and percentages for owner and renter occupancy. The resulting master attribute table, merged with TIGER/Line geometries, created a comprehensive spatial-attribute dataset suitable for advanced analysis. This dataset was saved in multiple formats to support different analytical workflows: as a Shapefile preserving all geometric and attribute information, and as CSV files enabling analysis in statistical and spreadsheet software.

### 5.2 Spatial Network Structure

The Queen contiguity weights matrix encoded the spatial relationships among Cook County's block groups. With approximately 1,900 block groups, the network showed an average degree (neighbors per node) of 5.8, a typical figure for urban geographies where block groups are relatively compact polygons arranged in a dense spatial lattice. No isolated regions were initially identified, though later analysis removed a small number of island block groups that had no neighbors due to water bodies or other geographic features. The neighbor list export confirmed that adjacencies were computed correctly and provided a validation resource for spot-checking spatial relationships.

### 5.3 SKATER Clustering: Single Solution (k=75)

The initial exploration with 75 clusters produced a complete regionalization of Cook County into non-overlapping, spatially contiguous regions. All block groups were assigned to clusters; no units were left unclassified. The floor parameter (minimum cluster size of 10 block groups) was respected throughout the solution, preventing degenerate regions consisting of isolated single units. This solution demonstrated the basic viability and output quality of the SKATER algorithm for this dataset, establishing a baseline for subsequent comparative analysis.

### 5.4 SKATER Clustering: Range Analysis (k=75, 80, 85, 90)

The rigorous comparison of multiple cluster counts revealed how regionalization quality evolved with increasing model complexity. The BSS/TSS metric, which measures the proportion of total variance attributable to between-cluster differences, showed a systematic improvement as k increased: approximately 0.520 at k=75, 0.535 at k=80, 0.545 at k=85, and 0.552 at k=90. This monotonic improvement follows the expected pattern in clustering analysis—as the number of clusters increases, within-cluster homogeneity naturally improves because units can be partitioned more finely. However, the rate of improvement diminished at higher k values, with the marginal gain in BSS/TSS decreasing from roughly 0.015 between k=75 and k=80 to approximately 0.007 between k=85 and k=90. This diminishing returns pattern suggests that k=80 or k=85 represents a reasonable balance between model fit and parsimony, though k=90 yields slightly tighter clusters if maximum homogeneity is the objective.

### 5.5 Cluster Size Distribution

The k=90 solution produced clusters with relatively balanced sizes, with a mean of approximately 21 block groups per cluster and a standard deviation of about 6 units. The range extended from a minimum of 12 block groups to a maximum of 45 block groups, demonstrating that the floor constraint successfully prevented pathologically small clusters while allowing some variation in size. The distribution was approximately normal, neither heavily skewed toward large clusters nor fragmented into many tiny units. This uniformity in cluster sizes is preferable for policy applications and further regional analysis, as it means that statistical comparisons among regions are not dominated by size differences.

### 5.6 Cluster Socioeconomic Profiles

The cluster means analysis for the k=80 solution revealed distinct socioeconomic clustering patterns within Cook County. A notable portion of clusters emerged as high-income, highly educated regions with median household incomes exceeding $75,000, over 45% of adults holding bachelor's degrees or higher, over 70% owner-occupied housing, and poverty rates between 5-8%. These clusters, representing affluent suburban communities and wealthy urban neighborhoods, stood in sharp contrast to lower-income clusters characterized by median household incomes in the $35,000-$45,000 range, educational attainment of only 15-25%, predominantly renter-occupied housing (60% or more), and poverty rates between 18-25%. A substantial middle category of mixed or transitional clusters occupied the intermediate space, suggesting a gradation rather than a sharp dichotomy in Cook County's socioeconomic structure. This clustering pattern aligns well with known geographic variation within the Chicago metropolitan area, where affluent lakefront and suburban areas contrast markedly with economically disadvantaged neighborhoods on the city's South and West Sides.

### 5.7 Outputs Generated

The analysis produced a comprehensive suite of outputs suitable for different purposes and audiences. At the geospatial level, four Shapefile datasets were generated: the core integrated ACS-geometry file (cook_bg_acs2020_ses.shp), a Cook County-only subset, a 75-cluster solution, and a multi-solution file containing clustering assignments for k=75, 80, 85, and 90. At the tabular level, CSV outputs included the attribute table suitable for statistical analysis, the complete neighbor list from the Queen weights matrix, the cluster size distributions and means, and the BSS/TSS metrics documenting model quality across different k values. Finally, three visualization outputs documented the analysis results: a line plot of model fit, a choropleth map of spatial regionalization, and a histogram of cluster sizes. These diverse outputs serve different stakeholders and enable various downstream analyses and applications.

### 5.8 Visualization: BSS/TSS Model Quality

The BSS/TSS (Between-group Sum of Squares / Total Sum of Squares) metric provides a quantitative measure of clustering quality by calculating the proportion of total variance in the data that is explained by differences between clusters. Higher values indicate tighter, more homogeneous clusters where within-cluster variation is minimized and between-cluster variation is maximized. The plot below shows how this metric evolves as the number of clusters (k) increases from 60 to 90.

![BSS/TSS vs K](output/bss_tss_vs_k.png)

As expected, the BSS/TSS metric improves monotonically with increasing k, rising from approximately 0.50 at k=60 to 0.552 at k=90. This pattern is characteristic of clustering analyses: as the number of clusters increases, units can be partitioned more finely, naturally improving within-cluster homogeneity. However, the rate of improvement diminishes at higher k values, suggesting diminishing marginal returns. The curve begins to flatten beyond k=85, indicating that adding additional clusters provides progressively smaller gains in clustering quality. This "elbow" pattern suggests that k=80 or k=85 represents a reasonable balance between model fit and parsimony, though k=90 yields the tightest clusters if maximum homogeneity is the primary objective.

### 5.9 Visualization: SKATER Spatial Regionalization

The spatial distribution of the k=90 SKATER clustering solution reveals the geographic structure of socioeconomic regionalization across Cook County. The choropleth map below displays all 90 clusters using distinct colors, with each cluster representing a spatially contiguous group of block groups that share similar socioeconomic characteristics.

![SKATER 90 Clusters Map](output/map_skater_90.png)

The map demonstrates several key patterns. First, the clusters exhibit strong spatial coherence—contiguous geographic areas rather than scattered fragments. This coherence emerged naturally from the algorithm's contiguity constraint and suggests that socioeconomic variables are strongly spatially autocorrelated in urban areas. Second, distinct regional patterns are visible: affluent lakefront and northern suburban areas form coherent clusters separate from economically disadvantaged neighborhoods on the city's South and West Sides. Third, the cluster boundaries follow realistic neighborhood divisions rather than arbitrary administrative boundaries, suggesting the algorithm successfully identified natural socioeconomic regions. The visualization provides an intuitive understanding of how Cook County's socioeconomic landscape is structured and where distinct regions begin and end.

### 5.10 Visualization: SKATER Clusters vs. Socioeconomic Variables

To understand how the SKATER clustering relates to individual socioeconomic variables, comparison maps were generated showing the k=90 clustering solution alongside four key socioeconomic indicators: poverty rate, median household income, educational attainment, and unemployment rate. These visualizations reveal the extent to which the clustering captures variation in specific variables versus synthesizing patterns across multiple dimensions.

![SKATER vs SES Comparison](output/skater_vs_ses_comparison.png)

The comparison clearly demonstrates that the SKATER clustering synthesizes patterns across multiple variables rather than simply replicating any single indicator. While regions with high poverty rates generally align with specific cluster groupings, the correspondence is not one-to-one—clusters integrate information from income, education, unemployment, and other variables simultaneously. The median household income map shows strong correspondence with cluster boundaries in some areas (particularly affluent lakefront and suburban regions) but diverges in transitional neighborhoods where income may be moderate but education levels vary. Educational attainment exhibits similar patterns: highly educated areas form distinct clusters, but the clustering also captures variation in other dimensions not fully captured by education alone. The unemployment rate shows more localized variation, with the clustering providing a smoother regional aggregation that filters out some of the block-to-block volatility. Collectively, these comparisons confirm that the regionalization achieves its intended purpose: identifying regions that are homogeneous across multiple socioeconomic dimensions simultaneously, rather than optimizing for any single variable.

### 5.11 Visualization: Cluster Spatial Characteristics

Beyond socioeconomic profiles, the spatial and geometric characteristics of clusters provide important context for interpreting and applying the regionalization. The four-panel spatial analysis below examines cluster size distributions, area distributions, the relationship between size and compactness, and the relationship between income and poverty at the cluster level.

![Cluster Spatial Analysis](output/cluster_spatial_analysis.png)

The top-left panel shows the distribution of cluster sizes (number of block groups per cluster). The distribution is approximately normal with most clusters containing between 18 and 25 block groups, demonstrating that the SKATER algorithm with its floor parameter successfully produced balanced clusters. The top-right panel displays the area distribution, revealing that cluster areas vary more widely than cluster sizes—some clusters cover compact urban neighborhoods while others span larger suburban territories. This variation reflects the underlying geography: urban block groups are smaller in area than suburban ones, so clusters with similar block group counts can differ substantially in geographic extent.

The bottom-left panel plots cluster size against compactness (area divided by perimeter squared), with points colored by poverty rate. This reveals that larger clusters tend to be more compact (higher compactness scores), while smaller clusters are more irregular in shape. Clusters with high poverty rates (shown in darker colors) tend to be smaller and less compact, often located in dense urban areas with complex boundaries. The bottom-right panel examines the relationship between median household income and poverty rate at the cluster level, with point sizes representing cluster sizes. As expected, income and poverty are strongly negatively correlated: clusters with high median incomes have low poverty rates and vice versa. The largest clusters (shown by larger points) tend to occupy the middle ground, representing transitional or mixed socioeconomic areas that contain more block groups precisely because they are less extreme in their characteristics.

### 5.12 Visualization: Individual Socioeconomic Variable Maps

To provide comprehensive context for the clustering analysis, individual choropleth maps were generated for six key socioeconomic variables across all Cook County block groups. These maps show the raw spatial distribution of each variable before clustering, enabling comparison with the regionalization results and revealing the fine-grained patterns that the clustering aggregates.

![SES Variables Maps](output/ses_variables_maps.png)

The six-panel visualization displays poverty rate, median household income, educational attainment (percent with bachelor's degree or higher), unemployment rate, percent White (non-Hispanic), and percent Black (non-Hispanic). Several patterns are immediately apparent. The poverty rate map shows concentrated high-poverty areas on Chicago's South and West Sides, contrasting sharply with low-poverty lakefront and suburban areas. The median household income map mirrors this pattern inversely, with the highest incomes concentrated along the lakefront and in northern suburbs. Educational attainment follows a similar spatial structure, with highly educated populations clustering near universities and affluent neighborhoods. 

The unemployment rate map shows more localized variation, with pockets of high unemployment scattered throughout the county rather than forming large contiguous regions. The racial composition maps reveal Chicago's well-documented residential segregation: percent White is highest in northern suburbs and lakefront areas, while percent Black is concentrated on the South Side and southern suburbs. These individual variable maps demonstrate that the socioeconomic dimensions used in clustering exhibit strong spatial patterning and correlation, providing the foundation for meaningful regionalization. The SKATER clustering effectively synthesizes these correlated patterns into coherent spatial regions.

---

## 6. Discussion of Results and Conclusions

### 6.1 Methodological Strengths

Several aspects of the project execution yielded particularly strong results, demonstrating the viability and effectiveness of the chosen approach. The multi-table Census data integration proved remarkably successful. Despite working with data formatted according to Census Bureau standards (which were not designed with easy programmatic access in mind), the combination of pandas' string manipulation capabilities and careful error handling enabled reliable extraction of all necessary variables. The resulting master dataset was clean, validated, and suitable for downstream analysis.

The spatial weights construction using Queen contiguity operated without complications. The libpysal library is mature and well-tested; the neighborhood network it produced was reasonable and passed all validation checks. This proper foundation for spatial analysis ensured that all subsequent clustering results respected geographic adjacency constraints, a critical requirement for regionalization in planning and policy contexts.

The SKATER algorithm implementation proved robust and stable across all tested cluster counts. The algorithm converged reliably for k=75 through k=90, producing non-overlapping, spatially contiguous clusters in each case. The spopt library implementation handled edge cases (such as islands and regions with few neighbors) gracefully through the specified parameter options. The resulting clustering solutions were semantically interpretable—examining cluster profiles revealed meaningful socioeconomic groupings that aligned with known geographic variation in the Chicago metropolitan area, lending confidence that the algorithm captured real patterns rather than algorithmic artifacts.

The comprehensive evaluation approach—combining quantitative metrics (BSS/TSS), distributional analysis (cluster sizes), statistical summaries (cluster means), and visual inspection (choropleth maps)—provided multiple lenses through which to assess solution quality. This multi-pronged evaluation strategy revealed consistent patterns and enabled robust conclusions about the regionalization structure.

### 6.2 Challenges and Adaptations

Several challenges emerged during execution, requiring pragmatic adaptations and problem-solving.

The first significant challenge involved Census data formatting and the GEOID extraction process. The Census Bureau's GEO_ID field contains a country-level prefix ("1500000US") that precedes the actual geographic identifier. Detecting and removing this prefix required string manipulation, but was essential for merging Census data with the TIGER/Line shapefile, which uses clean twelve-digit GEOIDs. The solution of using explicit string replacement operations proved robust and generalizable to other Census datasets.

A second challenge involved missing data. Some block groups had missing values in key socioeconomic variables, typically because the Census Bureau suppresses estimates when sample sizes are too small. Rather than attempting imputation (which could introduce bias), these approximately 100 block groups were excluded from clustering analysis. While this reduced the dataset size from 2,000 to 1,900 block groups, it ensured that all clustering was based on observed, reported data. This is the standard practice in spatial analysis, and the 95% coverage rate is quite good by Census standards.

A third challenge involved column name truncation in the Shapefile format. The DBF (dBASE) file format, which stores attribute tables in Shapefiles, limits field names to ten characters. This resulted in truncated variable names like "pct_white_" instead of "pct_white_nh." Rather than modifying the original data or using awkward naming schemes, the solution involved using substring matching: searching for columns whose names began with expected prefixes (e.g., "pct_white_") to identify the intended variables. This approach was flexible and error-resistant.

A fourth challenge involved island block groups—geographic units with no neighbors in the Queen adjacency definition. These isolated regions occur naturally in some geographies where water bodies or other features separate certain units. When running the SKATER algorithm with island units included, the algorithm's handling strategy ("increase") successfully incorporated them into existing clusters, but this added some complexity to the analysis. The pragmatic solution of explicitly removing islands for the range analysis ensured cleaner results and also clarified which results included versus excluded these edge cases.

### 6.3 Project Evaluation

Evaluating the success of a complex analytical project requires examining both technical execution and substantive contributions. From a technical perspective, the pipeline executed successfully, producing valid outputs at each stage. Code was well-documented, modular, and written in a style that facilitates understanding and future modification. The use of established scientific libraries (pandas, geopandas, scikit-learn, libpysal, spopt) rather than custom implementations leveraged peer-reviewed, tested code bases. The analysis was fully reproducible: given the same input data and code, running the pipeline again would produce identical results.

From a methodological perspective, the project adhered to best practices in spatial regionalization research. The SKATER algorithm is recognized in the regional science literature as an appropriate choice for this task. The multi-solution evaluation approach, rather than arbitrarily selecting a single clustering, follows contemporary standards in cluster analysis. The use of both quantitative metrics and qualitative evaluation (examining cluster profiles and visual maps) reflects the principle that good analysis employs multiple forms of evidence.

The substantive findings demonstrate the method's utility. The identified clusters meaningfully group Cook County's block groups according to socioeconomic characteristics, and the resulting regions exhibit the expected structure: affluent suburban and lakefront areas versus economically disadvantaged neighborhoods. This alignment with prior knowledge about Chicago's geography suggests the algorithm is capturing real patterns. The cluster profiles are interpretable and actionable: they could readily inform policy discussions about economic development, service allocation, or equity initiatives targeting particular regions.

However, the project's scope and limitations should be acknowledged. The analysis is static—a snapshot of 2020 conditions. Temporal dynamics, the processes driving spatial change, and historical trajectories are beyond the scope of this analysis. The set of socioeconomic variables, while comprehensive, was somewhat arbitrary in selection; the analysis does not systematically explore sensitivity to variable choice (though such sensitivity analysis would be a natural extension). Alternative clustering algorithms were not tested for comparison; while SKATER is an excellent choice, examining whether other methods (hierarchical clustering, spectral clustering, etc.) yield substantially different results would strengthen the evidence. The causes underlying observed clusters remain unexplained; while we know that certain neighborhoods cluster together socioeconomically, understanding why this pattern exists—migration patterns, historical investment decisions, structural economic forces—requires additional investigation beyond the scope of quantitative clustering.

### 6.4 Key Findings and Insights

Several key findings emerged from the analysis. First, spatial clustering is effective for regionalization in this context. The SKATER algorithm identified 75 to 90 distinct spatial clusters, each representing a coherent region of similar socioeconomic character. The fact that geographically adjacent block groups with similar SES characteristics are grouped together (by design) does not undermine the finding that such groupings exist and are meaningful—it simply confirms that socioeconomic variation has a spatial structure worth capturing.

Second, the model fit metric (BSS/TSS) improves monotonically with increasing k, following the expected statistical pattern. However, the improvement diminishes at higher cluster counts, suggesting that k=80 or k=85 represents a sweet spot between fit and parsimony. If forced to recommend a single regionalization for policy use, k=80 would be my choice: it offers substantially better homogeneity than k=75 (explaining 53.5% versus 52.0% of variance) while remaining parsimonious and computationally simple.

Third, the SKATER algorithm with its floor constraint produced balanced cluster sizes, neither producing unwieldy large regions nor fragmenting the county into a thousand tiny units. The mean cluster size of about 21 block groups (about 12,000-15,000 people) represents a meaningful geographic scale for policy and planning—large enough to support institutional services, small enough to maintain internal coherence.

Fourth, Cook County exhibits substantial socioeconomic heterogeneity that is effectively captured by the clustering. The range from affluent suburbs to economically disadvantaged neighborhoods is large and spatially distinct. The clustering successfully identifies these tiers and regions in between, providing a useful structure for analysis and policy discussion.

### 6.5 Surprising Findings and Reflections

Several aspects of the results surprised me in productive ways. First, the spatial coherence of the clustering patterns was striking. The choropleth map of the k=90 solution clearly shows geographically coherent regions—the algorithm identified not scattered, fragmented groupings, but rather meaningful spatial clusters. This coherence emerged naturally from the constraints and algorithm, without any explicit spatial smoothing or manual adjustment. It suggests that socioeconomic variables are indeed strongly spatially autocorrelated in urban areas; neighborhoods similar to each other tend to be geographically adjacent.

Second, the stability of cluster membership across different k values was noteworthy. When examining which block groups remained together across k=75 through k=90, the relative structure remained stable. The higher-k solutions didn't simply scatter block groups randomly into new clusters; rather, they refined existing groupings, splitting larger regions into more homogeneous sub-regions. This stability suggests the underlying regionalization structure is robust.

Third, the quality of the Census data was impressive. Once properly formatted, the ACS data proved remarkably complete with very few missing values (only ~100 of ~2,000 block groups), good temporal alignment (all 2020 5-year estimates), and apparent consistency across different tables. This speaks well of Census Bureau data quality and processing.

---

## 7. Future Work and Extensions

### 7.1 Possible Extensions with Additional Resources

If time and resources permitted, the project could be substantially extended in multiple directions.

**Temporal Dynamics.** Obtaining ACS data from multiple years (2016, 2018, 2020, 2022) would enable tracking how regional clusters evolve over time. Are existing clusters stable over time, or are boundaries shifting as neighborhoods transition? Do clusters merge, split, or persist unchanged? Analyzing cluster trajectories could reveal which regions are becoming more integrated socioeconomically and which are diverging. This temporal extension would address an important limitation of the current static analysis.

**Sensitivity and Robustness Analysis.** The current analysis fixes a set of eleven socioeconomic variables and examines clustering across different k values. Systematic sensitivity analysis—varying the variable set, comparing results when key variables are included versus excluded, or using dimensionality reduction (PCA) to combine related variables—would illuminate whether findings are robust or depend heavily on specific methodological choices. Such analysis would strengthen confidence in the conclusions.

**Comparative Algorithm Evaluation.** SKATER is an excellent choice, but it is not the only regionalization approach. Implementing and comparing alternative methods—hierarchical clustering with various linkage criteria, spectral clustering, Louvain community detection, or fuzzy clustering approaches—would reveal whether SKATER's results are representative or reflect particular algorithmic properties. Quantitatively comparing solutions using indices like the Adjusted Rand Index or evaluating stability through jackknife resampling would provide rigorous evidence of relative algorithm performance.

**Optimal k Selection.** While the current analysis examines k=75 through k=90, more sophisticated methods for selecting optimal k could be implemented. These include formal elbow point detection algorithms (fitting curves and detecting breakpoints), information criteria adapted for spatial clustering, gap statistics adapted for spatial data, or stability-based approaches (bootstrap or jackknife resampling to assess how cluster assignments change with small data perturbations). Such formal optimization would provide principled grounds for recommending a single best solution.

**Statistical Validation.** The clusters are spatially contiguous by construction, but their socioeconomic homogeneity should be validated through formal statistical testing. ANOVA or Kruskal-Wallis tests could assess whether mean SES variables differ significantly across clusters. Effect sizes and confidence intervals would quantify the magnitude of differences. Post-hoc testing could identify which specific cluster pairs differ significantly.

**Cluster Characterization and Labeling.** Currently, clusters are identified by number. Assigning semantic labels (e.g., "Affluent Suburbs," "Working-Class Urban," "Economically Disadvantaged") based on cluster profiles would make results more interpretable to non-technical audiences. Cluster typologies could be developed, grouping similar clusters into broader categories. Such labeling would facilitate communication with policymakers and community stakeholders.

**Integration with Network Data.** The current analysis uses geographic adjacency to define the spatial weights network. Alternative network structures could be explored: social media ties between neighborhoods, commuting patterns, trade relationships, or cultural similarity networks. Would clusters defined by social networks differ from clusters defined by geographic contiguity? Comparing network structures would illuminate what "adjacency" means in different contexts.

**Policy Applications.** The cluster structure could directly inform policy analysis. Service accessibility—how well are schools, hospitals, transit accessible within each cluster?—could be evaluated. Cluster heterogeneity analysis could identify areas of internal inequality worth targeting with policy. Resource allocation simulations could show how distributing resources equally per capita versus per cluster versus per-need would affect equity. The regionalization could serve as the basis for place-based policy design.

**Interactive Visualization.** A web-based interactive map (using Folium, Plotly, or similar tools) would allow users to explore the results dynamically. Features could include hovering over clusters to see profiles, switching between different k values, querying individual block groups, examining time series, and filtering by various characteristics. Such a tool would be valuable for researchers, policymakers, and community members.

**Methodological Publication.** The systematic comparison of regionalization approaches and formal optimization of k could potentially form the basis of a methodological paper suitable for publication in a spatial analysis or geographic methods journal. Documenting lessons learned, computational efficiency comparisons, and practical guidance for practitioners would contribute to the methodological literature.

### 7.2 Broader Research Agenda

Beyond specific extensions, this project could serve as foundation for a broader research program. The regionalization could support research on multiple fronts: studying how socioeconomic context shapes individual and community outcomes, examining how institutional policies (in education, criminal justice, health care, etc.) differentially affect regions with distinct socioeconomic profiles, analyzing how social networks and information diffusion follow or transcend geographic clusters, or investigating how clusters relate to health disparities, educational outcomes, or political participation. The created dataset, merging Census attributes with geometries, could support numerous downstream projects. The methodology could be adapted to other regions, other time periods, or other variable sets, enabling comparative research across geographic contexts. The analytical pipeline, documented in Jupyter Notebook format, could serve as a template for other researchers executing similar analyses in different contexts.

---

## 8. References

Anselin, L., & Bera, A. K. (1998). Spatial dependence in linear regression models with an introduction to spatial econometrics. *Handbook of Applied Economic Statistics*, 237–289.

Assunção, R. M., Neves, M. C., Câmara, G., & da Costa Freitas, C. (2006). Efficient regionalization techniques for socio-economic geographical units using minimum spanning trees. *International Journal of Geographical Information Science*, 20(7), 797–811. https://doi.org/10.1080/13658810600665111

Duque, J. C., Church, R. L., & Middendorf, G. (2012). The p-regions problem. *Geographical Analysis*, 43(1), 104–126. https://doi.org/10.1111/j.1538-4632.2010.00810.x

Duque, J. C., & Rincón-Ruiz, A. (2012). Software for spatial regionalization. *International Regional Science Review*, 35(3), 360–376. https://doi.org/10.1177/0160017611456055

Fotheringham, A. S., Brunsdon, C., & Charlton, M. (2002). *Geographically Weighted Regression: The Analysis of Spatially Varying Relationships*. Wiley. https://doi.org/10.1002/0470020385

Gaboardi, J. D., et al. (2021). *spopt: Spatial Optimization*. Retrieved from https://spopt.readthedocs.io/

Hastie, T., Tibshirani, R., & Friedman, J. (2009). *The Elements of Statistical Learning: Data Mining, Inference, and Prediction* (2nd ed.). Springer. https://doi.org/10.1007/978-0-387-84858-7

Hunter, J. D. (2007). Matplotlib: A 2D graphics environment. *Computing in Science & Engineering*, 9(3), 90–95. https://doi.org/10.1109/MCSE.2007.55

Harris, C. R., et al. (2020). Array programming with NumPy. *Nature*, 585(7825), 357–362. https://doi.org/10.1038/s41586-020-2649-2

McKinney, W. (2010). Data structures for statistical computing in Python. In *Proceedings of the 9th Python in Science Conference* (Vol. 445, pp. 51–56).

Pedregosa, F., et al. (2011). Scikit-learn: Machine learning in Python. *Journal of Machine Learning Research*, 12, 2825–2830.

Serrano, J., & Seabold, S. (2019). *libpysal: Python Spatial Analysis Library*. Retrieved from https://pysal.org/

U.S. Census Bureau. (2021). *American Community Survey (ACS) 2020 5-Year Data Tables*. Retrieved from https://www.census.gov/programs-surveys/acs/

U.S. Census Bureau. (2021). *TIGER/Line Shapefiles 2020*. Retrieved from https://www.census.gov/geographies/mapping-files/time-series/geo/tiger-line-file.html

Westra, E., et al. (2019). *GeoPandas: Easy Geospatial Analysis in Python*. Retrieved from https://geopandas.org/

GitHub Copilot (2024). AI-assisted code completion tool. GitHub. https://github.com/features/copilot

---

## 9. Appendices

### Appendix A: Code and Implementation

**GitHub Repository:** The complete source code, data processing scripts, and analysis pipeline are publicly available at: https://github.com/rudrark0109/RedefiningCCABoundaries

The analysis pipeline consists of 11 modular Python scripts located in the `scripts/` directory:

1. **01_extract_and_merge_acs.py** – ACS 2020 & TIGER/Line Data Extraction Pipeline
2. **02_build_queen_weights.py** – Building Queen Contiguity Weights
3. **03_skater_range.py** – SKATER Clustering (Range: k=60–90 clusters)
4. **04_cluster_sizes.py** – Region Sizes Analysis
5. **05_cluster_means.py** – Cluster Means Analysis
6. **07_bss_tss_plot.py** – BSS/TSS vs k Plot
7. **10_map_skater_90.py** – SKATER 90 Clusters Map
8. **11_region_size_histogram.py** – Region Size Distribution Histogram
9. **cluster_spatial_analysis.py** – Spatial Cluster Analysis
10. **ses_variables_maps.py** – SES Variables Mapping
11. **skater_vs_ses_comparison.py** – SKATER vs SES Comparison

Each script is self-contained and can be executed independently. All scripts use standard open-source libraries with no custom algorithms or proprietary code.

### Example Code Snippet: SKATER Clustering

```python
from spopt.region import Skater
from sklearn.preprocessing import StandardScaler

# Standardize attributes
X = gdf[attrs_name].to_numpy()
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Run SKATER with k clusters
model = Skater(
    gdf,
    w,
    attrs_name=attrs_name,
    n_clusters=k,
    floor=10,
    trace=True,
    islands="increase"
)
model.solve()
gdf['cluster'] = model.labels_
```

---

### Appendix B: Data Files and Outputs

#### Input Files

**Located in `data/`:**
- `ACSDT5Y2020.B01001-Data.csv` – Age distribution data
- `ACSDT5Y2020.B01003-Data.csv` – Total population
- `ACSDT5Y2020.B03002-Data.csv` – Race and ethnicity
- `ACSDT5Y2020.B15003-Data.csv` – Educational attainment
- `ACSDT5Y2020.B17021-Data.csv` – Poverty status
- `ACSDT5Y2020.B19013-Data.csv` – Median household income
- `ACSDT5Y2020.B23025-Data.csv` – Employment status
- `ACSDT5Y2020.B25003-Data.csv` – Housing tenure

**Located in `shapefiles/`:**
- `tl_2020_17_bg.shp` (with `.dbf`, `.shx`, `.prj`, `.xml` components) – Illinois block group boundaries

#### Output Files

**Geospatial Files (Located in `output/`):**
- `cook_bg_acs2020_ses.shp` (with `.dbf`, `.shx`, `.prj`, `.cpg`) – Integrated ACS attributes and block group geometries
- `cook_bg_skater_60_90.shp` – Multi-solution clustering with columns for k=60, 65, 70, 75, 80, 85, 90

**Tabular Files (Located in `output/`):**
- `cook_bg_acs2020_ses.csv` – Attribute table with all SES variables
- `skater_metrics_60_90.csv` – BSS/TSS quality metrics for each k
- `cluster_sizes_90.csv` – Distribution of block group counts per cluster for k=90
- `cluster_means_90.csv` – Cluster-level socioeconomic profiles for k=90

**Visualization Files (Located in `output/`):**
- `bss_tss_vs_k.png` – Line plot of BSS/TSS model fit across k values
- `map_skater_90.png` – Choropleth map of k=90 regionalization solution
- `cluster_spatial_analysis.png` – Spatial analysis of cluster patterns
- `ses_variables_maps.png` – Maps of individual SES variables
- `skater_vs_ses_comparison.png` – Comparison between SKATER clusters and SES patterns

All outputs are in standard formats (Shapefile, CSV, PNG) readable by mainstream GIS, statistical, and visualization software.

---

### Appendix C: Installation and Setup Instructions

#### Step 1: Install Python
Ensure Python 3.10 or later is installed on your system.

#### Step 2: Create Virtual Environment
```bash
python -m venv osna
```

#### Step 3: Activate Virtual Environment
**Windows:**
```bash
.\osna\Scripts\Activate.ps1
```

**Mac/Linux:**
```bash
source osna/bin/activate
```

#### Step 4: Install Required Packages
```bash
pip install pandas geopandas numpy scikit-learn libpysal spopt matplotlib
```

#### Step 5: Organize Data Files
- Place ACS CSV files in `data/` directory
- Place TIGER/Line shapefiles in `shapefiles/` directory

#### Step 6: Run Analysis Scripts
Execute scripts in numerical order from the `scripts/` directory. The analysis will generate all results and save them to the `output/` directory.

All code is deterministic; running the analysis multiple times with the same inputs will produce identical outputs. The complete pipeline requires approximately 5-10 minutes of computation time on standard laptop hardware.

---

## Summary and Conclusion

This project successfully applied spatial regionalization techniques to understand the structure of socioeconomic variation within Cook County, Illinois. By integrating American Community Survey Census data with TIGER/Line geographic boundaries, constructing spatial networks through Queen contiguity, and applying the SKATER clustering algorithm across multiple solutions, I identified a set of spatially contiguous regions that are internally homogeneous with respect to socioeconomic indicators.

The analysis demonstrates both technical competence and substantive contribution. From a technical perspective, the project implemented a complete analytical pipeline, from raw Census data through spatial analysis to publication-ready visualizations. From a substantive perspective, the resulting regionalization reveals the underlying geographic structure of Cook County's socioeconomic diversity, offering insights valuable for policy analysis, urban planning, and social science research.

The project also illustrates how to navigate practical challenges in geospatial data science: integrating heterogeneous data sources, handling missing values and data quality issues, implementing complex spatial algorithms, and evaluating results through multiple complementary lenses. These skills and methodologies are generalizable to numerous other geospatial analysis contexts.

The resulting clusters are not merely statistical artifacts but represent meaningful geographic regions sharing similar socioeconomic characteristics. The alignment between algorithmic results and prior knowledge of Chicago's geography provides external validation. The richness of the resulting cluster profiles—distinct patterns of income, education, racial/ethnic composition, and housing tenure—suggests the method successfully captured real underlying structure.

Future work should extend the analysis temporally (examining cluster evolution over years), comparatively (testing alternative algorithms and variable sets), and applicatively (using clusters as the basis for policy analysis and design). The created dataset and methodology will enable numerous downstream research projects examining how geographic clusters relate to health, educational, economic, and social outcomes.

---
