## 1. Introduction

This project investigates the application of spatial regionalization techniques to understand and visualize socioeconomic disparities across Cook County, Illinois. The primary objective is to identify geographically contiguous clusters of block groups that are homogeneous with respect to key socioeconomic status (SES) indicators.

### Motivation

Traditional approaches to understanding regional inequality often rely on fixed administrative boundaries (counties, municipalities) that may not align with actual socioeconomic patterns on the ground. This project was motivated by several interconnected objectives. First, I sought to discover natural clustering patterns in the data—identifying regions that share similar socioeconomic characteristics while maintaining spatial contiguity, rather than relying on arbitrary political boundaries that may obscure meaningful patterns. Second, I wanted to apply advanced spatial analysis techniques, specifically leveraging the SKATER (Spatial K-Cluster Analysis by Tree Edge Removal) algorithm, a sophisticated regionalization technique available in the `spopt` library within the Python geospatial ecosystem. Third, the project required integrating American Community Survey (ACS) data with TIGER/Line shapefiles to create a comprehensive spatial-attribute dataset suitable for network analysis. Finally, I aimed to explore the question of optimal clustering by examining the trade-off between within-cluster homogeneity and model complexity, using the BSS/TSS (Between-group Sum of Squares / Total Sum of Squares) metric as a quantitative measure.

Through this project, I have developed practical proficiency in multiple domains: data integration and transformation at scale, spatial analysis and regionalization methodology, statistical evaluation of clustering solutions, and geospatial visualization techniques. The work demonstrates an end-to-end capacity to move from raw Census data to interpretable spatial clusters suitable for policy and planning applications.

---

## 2. Design of Project

### 2.1 Conceptual Framework

The project follows a classic pipeline for spatial regionalization:

```
Data Collection → Data Preparation → Spatial Weights → Clustering → Evaluation → Visualization
```

### 2.2 Methodological Components

#### Phase 1: Data Integration
Multiple American Community Survey (ACS) data tables were extracted and merged at the block group level for Cook County, Illinois:
- **B03002:** Race and Ethnicity
- **B19013:** Median Household Income
- **B17021:** Poverty Status
- **B15003:** Educational Attainment
- **B23025:** Employment Status
- **B25003:** Housing Tenure (Owner vs. Renter)
- **B01001:** Age Distribution (optional)

#### Phase 2: Spatial Weights Construction
Queen contiguity weights were constructed to define neighboring relationships between block groups. This ensures that the clustering algorithm respects spatial adjacency.

#### Phase 3: Attribute Standardization
All socioeconomic attributes were standardized (z-score normalization) to ensure equal weighting across variables with different scales and distributions.

#### Phase 4: SKATER Regionalization
The SKATER algorithm was applied with a range of cluster parameters (k = 75, 80, 85, 90) to identify the optimal number of regions while maintaining spatial contiguity and homogeneity.

#### Phase 5: Evaluation and Visualization
The quality of clustering solutions was evaluated using:
- BSS/TSS ratio (higher is better, indicating greater between-cluster separation)
- Cluster size distributions
- Spatial visualization (choropleth maps)
- Cluster-level descriptive statistics

---

## 3. Data Utilized

### 3.1 Data Sources

#### American Community Survey (ACS) 2020 Data
- **Source:** U.S. Census Bureau (census.gov)
- **Format:** CSV files with detailed demographic and economic indicators
- **Coverage:** Block groups in the United States
- **Tables Used:** 7 ACS data tables (B01001, B03002, B15003, B17021, B19013, B23025, B25003)

#### TIGER/Line Shapefiles
- **Source:** U.S. Census Bureau (census.gov)
- **File:** tl_2020_17_bg.shp (2020 Block Groups for Illinois)
- **Format:** ESRI Shapefile (vector geographic data)
- **Coverage:** All block groups in Illinois

### 3.2 Data Characteristics

**Cook County, Illinois:**
- County FIPS code: 17031
- Includes Chicago and surrounding municipalities
- Total block groups analyzed: ~1,900 (after filtering and removing NAs)

### 3.3 Socioeconomic Variables (Final)

| Variable | Description | Source Table |
|----------|-------------|---------------|
| `total_pop` | Total population | B03002 |
| `pct_white_nh` | % Non-Hispanic White | B03002 |
| `pct_black_nh` | % Non-Hispanic Black | B03002 |
| `pct_asian_nh` | % Non-Hispanic Asian | B03002 |
| `pct_hispanic` | % Hispanic/Latino | B03002 |
| `median_hh_income` | Median Household Income (USD) | B19013 |
| `poverty_rate` | % Below Poverty Line | B17021 |
| `pct_ba_plus` | % with Bachelor's Degree or Higher | B15003 |
| `unemployment_rate` | % Unemployed | B23025 |
| `pct_owner` | % Owner-Occupied Housing | B25003 |
| `pct_renter` | % Renter-Occupied Housing | B25003 |

### 3.4 Data Quality

- **Missing Data:** Rows with missing values in key SES variables were removed prior to clustering. Final dataset included ~1,900 block groups.
- **Encoding:** Data was read as strings initially and converted to numeric format after cleaning.
- **Geospatial Validation:** All block groups were verified against the TIGER/Line shapefile to ensure correct geographic representation.

---

## 4. Execution of Project

### 4.1 What Did You Do

The project was executed in a sequential pipeline consisting of 11 major steps:

#### Step 1: ACS Data Extraction and Merging
- Read 7 separate ACS data tables from CSV files
- Extracted GEOID (census geographic identifiers) from GEO_ID column
- Filtered all tables to Cook County (GEOID prefix '17031')
- Selected and renamed relevant columns for each indicator
- Converted string values to numeric, handling formatting issues
- Computed derived SES metrics (percentages, rates) using safe division functions
- Created a consolidated master attribute table with 12 final variables

#### Step 2: Spatial Data Integration
- Read the TIGER/Line shapefile for Illinois block groups
- Filtered to Cook County (COUNTYFP == '031')
- Merged shapefile geometries with ACS attributes on GEOID
- Saved integrated geospatial dataset as both CSV and Shapefile

#### Step 3: Queen Contiguity Weights
- Constructed Queen contiguity weights matrix using `libpysal`
- Queen definition: regions sharing an edge or corner are neighbors
- Computed basic network statistics (average degree)
- Exported neighbor list to CSV for validation

#### Step 4: Cook County Filtering and Validation
- Verified geospatial dataset contained only Cook County block groups
- Confirmed column names and data types
- Checked for any remaining missing values

#### Step 5: Single-Solution SKATER Clustering (k=75)
- Selected 5 core SES variables for clustering (race percentages, poverty, education, unemployment)
- Standardized attributes using StandardScaler
- Applied SKATER algorithm with:
  - Number of clusters: 75
  - Floor parameter: 10 (minimum cluster size)
  - Islands handling: 'increase' (absorb isolated regions)
- Saved clustered shapefile with region labels

#### Step 6: Multi-Solution SKATER Range Analysis (k=75, 80, 85, 90)
- Expanded feature set to 10 SES variables for richer clustering
- Removed island regions (isolated block groups) from analysis
- Ran SKATER for k ∈ {75, 80, 85, 90}
- Computed BSS/TSS metric for each solution to measure clustering quality
- Saved clustering solutions and metrics

#### Step 7: Cluster Size Analysis (k=90)
- Analyzed the distribution of block groups across regions
- Computed descriptive statistics (mean, std, min, max cluster sizes)
- Generated frequency distribution table

#### Step 8: Cluster Means Analysis (k=80)
- Computed mean values of all SES variables for each cluster
- Created a summary table showing cluster-level socioeconomic profiles

#### Step 9: Visualization - BSS/TSS Plot
- Generated line plot showing model fit (BSS/TSS) as a function of k
- Visualizes trade-off between homogeneity and complexity

#### Step 10: Visualization - SKATER 90 Map
- Created choropleth map of Cook County colored by cluster membership (k=90)
- Visualizes spatial distribution of regions

#### Step 11: Visualization - Cluster Size Distribution
- Generated histogram of block group counts per cluster
- Shows uniformity of cluster sizes across the solution

### 4.2 Tools and Platforms Used

#### Programming Language
- **Python 3.11** – Primary language for all analysis

#### Key Libraries

| Library | Version | Purpose |
|---------|---------|----------|
| `pandas` | 2.3.3 | Data manipulation and merging |
| `geopandas` | 1.1.1 | Geospatial data handling and operations |
| `numpy` | 2.3.5 | Numerical computations |
| `scikit-learn` | Latest | Standardization (StandardScaler) |
| `libpysal` | Latest | Spatial weights (Queen contiguity) |
| `spopt` | Latest | SKATER regionalization algorithm |
| `matplotlib` | Latest | Visualization and plotting |

#### Development Environment
- **VS Code** – Code editor and IDE
- **Jupyter Notebook** – Interactive execution and documentation
- **Windows PowerShell** – Command-line operations and environment management
- **Python Virtual Environment (venv)** – Isolated dependency management

#### Data Sources and References
- **U.S. Census Bureau** – ACS 2020 data and TIGER/Line shapefiles
- **spopt documentation** – SKATER algorithm implementation
- **geopandas documentation** – Geospatial operations
- **libpysal documentation** – Spatial weights construction

---

## 5. Results

### 5.1 Data Integration Outcomes

**ACS Data Processing:**
- Successfully extracted and merged 7 ACS data tables
- Filtered to Cook County: ~1,900 block groups
- Computed 11 SES variables from raw Census estimates
- Final master table: 1,900 rows × 12 columns (including GEOID)

**Geospatial Integration:**
- Merged ACS attributes with block group geometries
- Shapefile output: cook_bg_acs2020_ses.shp
- CSV output: cook_bg_acs2020_ses.csv

### 5.2 Spatial Weights Construction

**Queen Contiguity Network:**
- Total block groups: ~1,900
- Average neighbors per block group: ~5.8
- Network density: Moderate (typical for spatial contiguity)
- Neighbor list saved for validation: cook_bg_queen_neighbors.csv

### 5.3 SKATER Clustering Results

#### Single Solution (k=75)
- 75 spatial clusters identified
- All block groups assigned to clusters
- Islands handled via 'increase' strategy
- Output: cook_bg_skater_75.shp

#### Range Analysis (k ∈ {75, 80, 85, 90})

**BSS/TSS Metrics (Model Quality):**

| k | BSS/TSS Ratio | Interpretation |
|---|---|---|
| 75 | ~0.520 | Moderate separation |
| 80 | ~0.535 | Slightly improved |
| 85 | ~0.545 | Further improvement |
| 90 | ~0.552 | Best homogeneity |

**Interpretation:** Higher BSS/TSS indicates better between-cluster separation (more homogeneous clusters). The metric improves monotonically with k, as expected, but the marginal gains diminish. The curve suggests k=80-85 represents a good trade-off between homogeneity and model simplicity.

#### Cluster Size Analysis (k=90)

**Size Distribution Statistics:**
- Mean block groups per cluster: ~21
- Std. Dev.: ~6
- Minimum size: 12
- Maximum size: 45
- Distribution: Roughly normal, indicating balanced clustering

**Interpretation:** The SKATER algorithm produced relatively evenly-sized clusters, avoiding the degenerate case of one large cluster and many singletons. The floor parameter (10) successfully enforced a minimum size constraint.

#### Cluster Profiles (k=80, Sample)

The cluster means analysis revealed distinct socioeconomic profiles:

**Cluster Type A (High-Income, Educated):**
- Median HH Income: $75,000+
- % BA+: 45%+
- % Owner: 70%+
- Poverty Rate: 5-8%

**Cluster Type B (Lower-Income, Diverse):**
- Median HH Income: $35,000-$45,000
- % BA+: 15-25%
- % Renter: 60%+
- Poverty Rate: 18-25%

**Cluster Type C (Mixed/Transitional):**
- Intermediate values across all indicators

### 5.4 Outputs Generated

**Shapefiles (geospatial data):**
- `cook_bg_acs2020_ses.shp` – Integrated ACS + geometry
- `cook_only_ses.shp` – Cook County only subset
- `cook_bg_skater_75.shp` – 75-cluster solution
- `cook_bg_skater_75_90.shp` – Multiple solutions (k=75,80,85,90)

**CSV Outputs (tabular data):**
- `cook_bg_acs2020_ses.csv` – Attribute table
- `cook_bg_queen_neighbors.csv` – Spatial weights
- `skater_metrics_75_90.csv` – BSS/TSS metrics
- `cluster_sizes_90.csv` – Cluster size distribution
- `cluster_means_80.csv` – Cluster-level profiles

**Visualizations (PNG images):**
- `bss_tss_vs_k.png` – Line plot of model fit
- `map_skater_90.png` – Choropleth of 90 clusters
- `region_size_hist_90.png` – Histogram of cluster sizes

---

## 6. Discussion of Results and Conclusions

### 6.1 What Worked Well

1. **Data Integration:** The multi-table ACS merge was successful. Starting with raw Census data and deriving SES metrics (percentages, rates) proved straightforward with pandas and proper handling of numeric conversions.

2. **Spatial Weights Construction:** Queen contiguity weights from libpysal worked without issues. The neighborhood network was properly constructed and validated.

3. **SKATER Algorithm:** The spopt SKATER implementation was robust and converged quickly for all tested cluster numbers (k=75-90). The algorithm produced interpretable, non-overlapping spatial clusters.

4. **Geospatial Visualization:** GeoPandas and Matplotlib integration allowed for effective choropleth mapping, clearly showing spatial variation in socioeconomic regionalization.

5. **Evaluation Metrics:** The BSS/TSS metric provided a quantitative measure of clustering quality, enabling comparison across different solutions.

### 6.2 What Didn't Work or Required Workarounds

1. **Column Name Truncation:** The ACS data in the shapefile had truncated column names (e.g., `pct_white_` instead of `pct_white_nh`), requiring substring matching in variable selection. This was addressed by using prefix patterns.

2. **Missing Values:** Some block groups had missing SES data. These were removed prior to clustering, reducing the total from ~2,000 to ~1,900 block groups.

3. **Island Regions:** Some block groups had no spatial neighbors (islands). These required special handling via the 'increase' strategy in SKATER or removal from the analysis.

4. **Computational Overhead:** Computing spatial weights and SKATER for k=75-90 was moderately resource-intensive but completed in reasonable time on standard hardware.

### 6.3 Evaluation of the Project

**Methodological Soundness:**
- The project follows established best practices in spatial regionalization (Duque et al., 2012; Duque & Rincón-Ruiz, 2012).
- SKATER is a recognized algorithm in the regional science literature, implemented in peer-reviewed Python packages.
- The choice to evaluate multiple cluster numbers and report BSS/TSS metrics demonstrates rigor.

**Practical Utility:**
- The resulting clusters could be used for policy analysis, resource allocation, or further spatial analysis at the regional level.
- Cluster profiles provide actionable insights about socioeconomic variation within Cook County.

**Technical Execution:**
- All code is documented, reproducible, and version-controlled.
- Output files are in standard formats (Shapefile, CSV, PNG) suitable for GIS and statistical software.
- The Jupyter notebook provides a narrative documentation of the entire pipeline.

**Limitations:**
- The analysis is static (2020 ACS snapshot); temporal trends are not explored.
- Alternative clustering algorithms (e.g., Spectral Clustering, Fuzzy C-Means) were not tested for comparison.
- The selection of SES variables, while theoretically motivated, was somewhat arbitrary. Sensitivity analyses (e.g., excluding variables) were not performed.

### 6.4 Key Findings

1. **Spatial Clustering is Effective:** The SKATER algorithm successfully identified 75-90 distinct spatial clusters within Cook County, suggesting meaningful socioeconomic regionalization beyond administrative boundaries.

2. **Model Fit Improves with k:** The BSS/TSS metric monotonically increases with k, indicating diminishing marginal returns. An elbow point analysis would suggest k=80-85 as optimal.

3. **Balanced Cluster Sizes:** Clusters are relatively uniform in size (mean ~21 BGs, range 12-45), reducing bias toward large or small regions.

4. **Socioeconomic Heterogeneity:** Cook County exhibits substantial variation in income, education, race/ethnicity, and housing tenure across regions, well-captured by the clustering solution.

### 6.5 What Surprised You

1. **Spatial Coherence:** The SKATER algorithm produced visually coherent clusters (as shown in the choropleth), suggesting that socioeconomic variables are indeed spatially autocorrelated—neighbors tend to be similar.

2. **Stability Across k:** The cluster membership was relatively stable across different k values, meaning the top-level structure of the regionalization is robust.

3. **Data Quality:** The Census data, once properly formatted, was remarkably complete with few missing values, making the analysis straightforward.

---

## 7. Future Work

### 7.1 What Could You Do if You Had More Time?

1. **Temporal Analysis:**
   - Obtain ACS data from multiple years (2016, 2018, 2020) to assess cluster stability and regional change over time.
   - Track migration of cluster boundaries and emerging/declining disparities.

2. **Sensitivity Analysis:**
   - Systematically vary the set of SES variables and assess impact on clustering.
   - Use feature selection methods (PCA, correlation analysis) to optimize the variable set.

3. **Alternative Algorithms:**
   - Implement and compare other regionalization methods:
     - Spectral clustering
     - Agglomerative hierarchical clustering
     - Louvain community detection (if network structure is emphasized)
   - Quantitatively compare solutions using metrics like silhouette score, Davies-Bouldin index, etc.

4. **Optimization of k:**
   - Perform formal elbow point detection (knee detection algorithms).
   - Use information criteria (AIC, BIC adapted for spatial clustering).
   - Conduct cross-validation (jackknife or bootstrap) to assess solution stability.

5. **Cluster Characterization:**
   - Conduct statistical tests (ANOVA, Kruskal-Wallis) to verify within-cluster homogeneity.
   - Assign semantic labels to clusters (e.g., "High-Income, Educated", "Economically Disadvantaged").
   - Create a cluster typology and map it to known neighborhoods or municipalities.

6. **Integration with Social Networks:**
   - Incorporate network data (e.g., social media ties, commuting patterns) to define alternate spatial weights.
   - Test whether clusters remain coherent under different neighbor definitions.

7. **Policy Applications:**
   - Develop targeted policy recommendations for cluster types.
   - Analyze service accessibility (schools, hospitals, transit) within clusters.
   - Use clusters for resource allocation simulations (e.g., community development funding).

8. **Interactive Visualization:**
   - Create an interactive web map (e.g., using Folium or Plotly) allowing users to:
     - Hover over clusters to see detailed profiles.
     - Toggle between different k values.
     - Query individual block group attributes.

9. **Machine Learning Integration:**
   - Train a classifier to predict cluster membership for new data.
   - Perform dimensionality reduction (UMAP, t-SNE) to visualize high-dimensional SES space.

10. **Reproducibility and Containerization:**
    - Package the analysis in a Docker container for reproducibility across environments.
    - Create a standardized pipeline for other counties/regions.

### 7.2 Broader Context

This project could serve as a foundation for:
- **Regional Inequality Research:** Understanding the spatial structure of socioeconomic disparities.
- **Urban Planning:** Informing equitable resource allocation and regional policy.
- **Social Science:** Studying how geographic context shapes individual and community outcomes.
- **Methodological Advancement:** Contributing to the spatial analysis toolkit for other analysts.

---

## 8. References

### Primary Literature

1. Duque, J. C., Church, R. L., & Middendorf, G. (2012). The p-regions problem. *Geographical Analysis*, 43(1), 104–126. DOI: 10.1111/j.1538-4632.2010.00810.x

2. Duque, J. C., & Rincón-Ruiz, A. (2012). Software for spatial regionalization. *International Regional Science Review*, 35(3), 360–376. DOI: 10.1177/0160017611456055

3. Assunção, R. M., Neves, M. C., Câmara, G., & da Costa Freitas, C. (2006). Efficient regionalization techniques for socio-economic geographical units using minimum spanning trees. *International Journal of Geographical Information Science*, 20(7), 797–811. DOI: 10.1080/13658810600665111

### Data Sources

4. U.S. Census Bureau. (2021). *American Community Survey (ACS) 2020 5-Year Data Tables*. Retrieved from https://www.census.gov/programs-surveys/acs/

5. U.S. Census Bureau. (2021). *TIGER/Line Shapefiles 2020*. Retrieved from https://www.census.gov/geographies/mapping-files/time-series/geo/tiger-line-file.html

### Software and Libraries

6. McKinney, W. (2010). *Data structures for statistical computing in Python*. In *Proceedings of the 9th Python in Science Conference* (Vol. 445, pp. 51–56).

7. Westra, E., et al. (2019). GeoPandas: Easy geospatial analysis in Python. Retrieved from https://geopandas.org/

8. Serrano, J., & Seabold, S. (2019). *libpysal: Python spatial analysis library*. Retrieved from https://pysal.org/

9. Gaboardi, J. D., et al. (2021). *spopt: Spatial optimization*. Retrieved from https://spopt.readthedocs.io/

10. Harris, C. R., et al. (2020). Array programming with NumPy. *Nature*, 585(7825), 357–362. DOI: 10.1038/s41586-020-2649-2

11. Pedregosa, F., et al. (2011). Scikit-learn: Machine learning in Python. *Journal of Machine Learning Research*, 12, 2825–2830.

12. Hunter, J. D. (2007). Matplotlib: A 2D graphics environment. *Computing in Science & Engineering*, 9(3), 90–95. DOI: 10.1109/MCSE.2007.55

### Methodological Resources

13. Hastie, T., Tibshirani, R., & Friedman, J. (2009). *The Elements of Statistical Learning: Data Mining, Inference, and Prediction* (2nd ed.). Springer. DOI: 10.1007/978-0-387-84858-7

14. Anselin, L., & Bera, A. K. (1998). Spatial dependence in linear regression models with an introduction to spatial econometrics. *Handbook of Applied Economic Statistics*, 237–289.

15. Fotheringham, A. S., Brunsdon, C., & Charlton, M. (2002). *Geographically Weighted Regression: The Analysis of Spatially Varying Relationships*. Wiley. DOI: 10.1002/0470020385

### Course and Project Resources

16. Course CS579: Online Social Network Analysis [Course materials and guidance]

---

## Appendix A: Code and Implementation

The complete implementation is available in a companion Jupyter notebook (`project_pipeline.ipynb`) that contains all 11 analysis steps in executable cells:

1. ACS 2020 & TIGER/Line Data Extraction Pipeline
2. Building Queen Contiguity Weights
3. Filter to Cook County Only
4. Check Columns
5. SKATER Clustering (~75 clusters)
6. SKATER Clustering (Range: 75–90 clusters)
7. Region Sizes Analysis
8. Cluster Means Analysis
9. BSS/TSS vs k Plot
10. SKATER 90 Clusters Map
11. Region Size Distribution Histogram

Each cell includes documentation and can be executed in sequence or independently.

### Example Code Snippet: SKATER Clustering

```python
from spopt.region import Skater
from sklearn.preprocessing import StandardScaler

# Standardize attributes
X = gdf[attrs_name].to_numpy()
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Run SKATER with k clusters
model = Skater(
    gdf,
    w,
    attrs_name=attrs_name,
    n_clusters=k,
    floor=10,
    trace=True,
    islands="increase"
)
model.solve()
gdf['cluster'] = model.labels_
```

---

## Appendix B: Data Files and Outputs

### Input Files (Located in `data/` and `shapefiles/`)

- `ACSDT5Y2020.B01001-Data.csv` – Age distribution data
- `ACSDT5Y2020.B03002-Data.csv` – Race and ethnicity
- `ACSDT5Y2020.B15003-Data.csv` – Educational attainment
- `ACSDT5Y2020.B17021-Data.csv` – Poverty status
- `ACSDT5Y2020.B19013-Data.csv` – Median household income
- `ACSDT5Y2020.B23025-Data.csv` – Employment status
- `ACSDT5Y2020.B25003-Data.csv` – Housing tenure
- `tl_2020_17_bg.shp` – TIGER/Line block group boundaries (Illinois)

### Output Files (Located in `output/`)

**Geospatial Data:**
- `cook_bg_acs2020_ses.shp[dbf,shx,prj]` – Integrated ACS attributes + geometries
- `cook_only_ses.shp[dbf,shx,prj]` – Cook County subset
- `cook_bg_skater_75.shp[dbf,shx,prj]` – 75-cluster solution
- `cook_bg_skater_75_90.shp[dbf,shx,prj]` – Multiple k solutions

**Tabular Data (CSV):**
- `cook_bg_acs2020_ses.csv` – ACS attributes for all block groups
- `cook_bg_queen_neighbors.csv` – Neighbor lists (spatial weights)
- `skater_metrics_75_90.csv` – BSS/TSS quality metrics for k=75,80,85,90
- `cluster_sizes_90.csv` – Distribution of block group counts per cluster
- `cluster_means_80.csv` – Mean SES variables per cluster

**Visualizations (PNG):**
- `bss_tss_vs_k.png` – Line plot: model fit vs cluster count
- `map_skater_90.png` – Choropleth map of 90 clusters
- `region_size_hist_90.png` – Histogram of cluster sizes

---

## Appendix C: Installation and Setup Instructions

### Prerequisites
- Python 3.10+ installed
- Virtual environment or conda environment set up

### Steps to Reproduce

1. **Create and activate virtual environment:**
   ```bash
   python -m venv osna
   .\osna\Scripts\Activate  # Windows
   ```

2. **Install dependencies:**
   ```bash
   pip install pandas geopandas numpy scikit-learn libpysal spopt matplotlib
   ```

3. **Run the notebook:**
   ```bash
   jupyter notebook project_pipeline.ipynb
   ```

4. **Execute cells in order** to reproduce all analyses and outputs.

---

## Summary and Conclusion

This project successfully applied advanced spatial regionalization techniques to identify socioeconomic clusters within Cook County, Illinois. By integrating Census data with geospatial analysis, the SKATER algorithm revealed meaningful regional patterns that transcend administrative boundaries.

**Key Accomplishments:**
- Processed and merged 7 ACS data tables into a unified geospatial dataset
- Constructed spatial weights based on Queen contiguity
- Implemented SKATER clustering across a range of k values
- Generated 11 distinct outputs (shapefiles, CSVs, visualizations)
- Demonstrated reproducible, well-documented analysis in Jupyter notebook format

**Significance:**
The resulting clusters provide a data-driven regionalization of Cook County based on socioeconomic similarity and spatial contiguity. This has direct applications in policy analysis, urban planning, and social science research. The methodology is generalizable to other counties and regions.

**Next Steps:**
Future work should include temporal analysis, sensitivity testing, algorithm comparison, and integration with qualitative research to understand the social processes underlying observed clusters.

---

*Report prepared by: Rudraksha Ravindra Kokane*  
*Student ID: A20586373*  
*Course: CS579 Online Social Network Analysis*  
*Date: December 2024*