Machine-Learning-Course-Project

The documents, code, figures, etc. uploaded here is produced for the course project only

Developing a County Level Social Vulnerability Index (SoVI) for the Contiguous United States

To develop the Social Vulnerability Index (SoVI), I downloaded a broad set of demographic and socioeconomic variables from the U.S. Census Bureau (https://data.census.gov/). Then these variables are normalized as either percentage, per capita values, or density functions. I standardized every variable to a z‐score (mean = 0, standard deviation = 1) (Eq. below). This ensures that variables measured on different scales contribute equally to the downstream analysis.

z= (x- μ)/σ

These dataprocessing is done in the google colab. The code is provided in this repository (https://github.com/shamsudduhasami/Machine-Learning-Course-Project/blob/main/Project_Machine_Learning_Course.ipynb). Then I saved the processed data also uploded in the repository (https://github.com/shamsudduhasami/Machine-Learning-Course-Project/blob/main/2_Project_Processed_Data.csv) for the use of PCA in RStudio.

I performed a principal components analysis (PCA) in RStudio (code link https://github.com/shamsudduhasami/Machine-Learning-Course-Project/blob/main/PCA%20Analysis%20Part%20in%20RStudio.R) on these standardized variables from the 2_Project_Processed_Data.csv dataset, using a varimax rotation to sharpen the definition of each component and applying the Kaiser criterion (eigenvalues > 1) to determine the appropriate number of factors. Then I saved the PCA analysis result uploaded in this repository (https://github.com/shamsudduhasami/Machine-Learning-Course-Project/blob/main/3_Project_PCA_Analysis_Data.csv).

After that I loaded the 3_Project_PCA_Analysis_Data.csv in the following colab code. Then I examined the factor loadings by looking at the correlations between each input variable (https://github.com/shamsudduhasami/Machine-Learning-Course-Project/blob/main/Heatmap%20of%20PCA%20Loadings.png).

I extracted components to interpret the underlying dimensions of social vulnerability. Variables loading strongly (absolute value > 0.60) on a component guided its labeling (for example, a “Wealth & Economic Status” factor for 5 variables in the table).

Because some factors inherently reflect decreased vulnerability (such as higher wealth) while others reflect increased vulnerability (such as greater poverty), I applied a directional adjustment (cardinality) to each component. In practice, this meant multiplying “protective” factors by -1 so that higher scores on every adjusted component uniformly indicate greater vulnerability. I exported these adjusted component scores for each census tract to a separate file named 3_Project_PCA_Analysis_Data.csv also uploaded in the GitHub repository.

Finally, I aggregated the component scores in a simple additive model, summing each cardinality‐adjusted factor to compute a single SoVI score per county and saved the data uploaded in the repository (https://github.com/shamsudduhasami/Machine-Learning-Course-Project/blob/main/4_Project_PCA_Analysis_Processed.xlsx) to use in ArcGIS Pro to make the vulnerability analysis. To visualize spatial patterns of vulnerability, I mapped these SoVI scores (https://github.com/shamsudduhasami/Machine-Learning-Course-Project/blob/main/SoVI%20Map.jpg) in ArcGIS Pro using a quantile classification, highlighting areas of very low, low, medium, high, and very high social vulnerability. This narrative methodology transforms complex, multi‐dimensional census data into an intuitive index that decision makers can use to identify and compare vulnerable communities. Notably, the use of PCA is significant as it reduces the dimensionality of complex socioeconomic data and uncovers latent structures in vulnerability metrics. In this study, PCA distilled a large number of socio-demographic variables into seven distinct components, each representing a key dimension of vulnerability. The retained components – Wealth & Economic Status, Housing Vulnerability, Social Dependence, Language Barriers & Health Insurance, Health Access, Native American Populations, and Gender – together explained a substantial portion of the variance in the data (comparable to prior SoVI studies capturing about 74% of variance with multiple factors).

This study addresses a critical research gap in hazard vulnerability assessment: the need for a consistent, multivariate mapping of social vulnerability across the United States. By constructing a county-level Social Vulnerability Index (SoVI) for the contiguous U.S. using principal components analysis (PCA), I provide a standardized, data-driven measure of social vulnerability that is comparable across regions. Previous approaches often focused on single regions or lacked a spatial validation component, so this nationwide PCA-based SoVI offers a unified framework to identify vulnerable counties and examine their spatial clustering. Importantly, I augment the national analysis with a focused examination in Texas, thereby bridging broad-scale vulnerability assessment with detailed state-level insights. This dual-scale approach ensures that the index not only highlights country-wide patterns but is also validated and interpreted in a local context, addressing the gap in consistent multivariate vulnerability mapping from the national down to the state level. By capturing the majority of information in fewer factors, the PCA-based index minimizes redundancy and emphasizes underlying patterns in the data. Interpreting these components provides insight into the drivers of vulnerability.

Mapping the resulting SoVI scores across the contiguous U.S. reveals clear and meaningful spatial patterns. The index highlights pronounced regional disparities. Generally, higher vulnerability scores concentrate in parts of the South and Southwest, whereas lower scores are more common in the Upper Midwest and other northern interior regions. This finding aligns with earlier studies that found clusters of highly vulnerable counties in areas like South Texas and the Mississippi Delta, in contrast to more resilient profiles in many Midwestern communities. Visualizing the SoVI on a map is significant because it provides an intuitive overview of where communities may have uneven capacity for preparedness and response, and where resources might be used most effectively to reduce vulnerability. The map of SoVI scores essentially serves as a diagnostic tool to pinpoint areas with potentially limited ability to cope with hazards. Decision-makers can discern not only local high-risk areas but also how different regions stack up against each other in terms of social vulnerability. Overall, the act of mapping translates the abstract index values into actionable geographic knowledge, highlighting where interventions or additional resources may be most needed.

Cluster Analysis for Texas

After calculating the Social Vulnerability Index (SoVI) scores for all counties, I conducted spatial autocorrelation analyses in ArcGIS Pro to examine clustering patterns in Texas. This two-step approach involved an Incremental Spatial Autocorrelation (ISA) using Global Moran’s I, followed by a Local Moran’s I (Anselin) cluster and outlier analysis at the optimal distance. Applying these spatial analyses to the SoVI results is important for determining whether high or low vulnerability counties tend to cluster together beyond random chance. Identifying significant clusters of social vulnerability can highlight areas that may require targeted policy attention or resources, and Texas was selected for focused analysis because prior studies have indicated the presence of pronounced vulnerability clusters in this state. Texas’s large size and socio-demographic diversity, as well as its frequent exposure to natural hazards, make it an ideal case for investigating spatial patterns of social vulnerability.

Incremental Spatial Autocorrelation (Global Moran’s I)

To determine the scale at which SoVI values exhibit the strongest spatial clustering, I performed an incremental spatial autocorrelation analysis using Global Moran’s I. This tool evaluates spatial autocorrelation (clustering or dispersion) over a range of distances. I specified a fixed number of 30 distance bands, starting at 150,000 meters and increasing in 10,000-meter increments (Euclidean distance with row-standardized weights). At each distance band, Moran’s I and its corresponding z-score were computed to measure the intensity of spatial clustering of county SoVI values. In line with theoretical expectations, the Global Moran’s I z-scores increased with distance initially, indicating intensification of spatial clustering, and then reached a distinct peak before declining. The figure below illustrates the resulting z-scores as a function of distance.

The analysis identified a clear maximum z-score at the 300,000-meter distance band, where the z-score reaches approximately 17.3 (p < 0.0001), after which the z-scores plateau or decrease slightly. This distance (300 km) represents the scale at which spatial processes promoting clustering of SoVI are most pronounced in Texas. In other words, the strongest non-random spatial clustering of social vulnerability among Texas counties occurs at around a 300 km neighborhood size. I selected this distance threshold for subsequent local clustering analysis, as it reflects the spatial scale of the most significant clustering in the data (i.e. the maximum peak identified, rather than the first peak at 170 km, given that clustering continued to intensify up to 300 km).

Global Moran's I Report

The Global Moran’s I test confirms that county‑level SoVI scores in Texas are far from random. With Moran’s I = 0.16 and an exceptionally high z‑score of 17.3 (p < 0.001) using a 300‑km fixed‑distance neighborhood, the analysis indicates strong positive spatial autocorrelation: counties with high social vulnerability are clustered near other high‑vulnerability counties, while low‑vulnerability counties cluster together as well. In practical terms, social vulnerability in Texas forms distinct geographic clusters rather than being evenly or randomly distributed, underscoring the need for place‑focused hazard‑mitigation and resilience strategies.

Local Moran’s I Cluster and Outlier Analysis

Using the 300 km distance as the threshold, I conducted a Local Moran’s I cluster and outlier analysis (Anselin Local Moran’s I) to identify the specific locations of significant clusters or outliers of social vulnerability. This analysis was implemented with a fixed distance band of 300,000 m (Euclidean distance), with row standardization applied to the spatial weights. I used a significance level of 0.05 with no false discovery rate (FDR) correction (i.e. p < 0.05 for significance, unadjusted for multiple comparisons) and employed the default analytical approach for significance (0 permutations, relying on the asymptotic distribution of the Local Moran’s I). The Local Moran’s I statistic evaluates each county’s SoVI value in the context of its neighbors’ values within the 300 km band. A positive Local Moran’s I indicates that a county has neighbors with similar values (either high-high or low-low), forming a cluster, while a negative Local Moran’s I indicates that a county’s value is dissimilar to its neighbors (either high surrounded by low, or low surrounded by high), identifying it as a spatial outlier. For each county, the analysis yields a cluster/outlier category code: High-High (HH) for a high SoVI county surrounded by high SoVI neighbors, Low-Low (LL) for a low SoVI county surrounded by low-value neighbors, High-Low (HL) for a high-vulnerability county surrounded by low-vulnerability neighbors, or Low-High (LH) for a low-vulnerability county with mostly high-vulnerability neighbors. Counties not meeting the 95% confidence level for any pattern are labeled not significant. This local analysis allows to map and interpret the geography of social vulnerability in Texas, pinpointing not only broad clusters of vulnerability but also anomalous counties that depart from their local context.

The cluster map of Texas reveals several significant spatial patterns in the distribution of SoVI scores. Most prominently, a large High-High cluster of social vulnerability is evident along the Texas–Mexico border in the southern part of the state. This includes many counties in the Rio Grande Valley and South Texas. These counties all have high SoVI values and are neighboring each other, resulting in a statistically significant cluster of high social vulnerability. This pattern aligns with broader regional trends observed in national studies, which have noted that counties along the Texas–Mexico border tend to exhibit elevated social vulnerability in a clustered fashion. The High-High cluster in South Texas highlights an area where compounding social vulnerabilities (such as poverty, lack of access to resources, and vulnerable population demographics) are spatially concentrated. In contrast, I also identified areas of significantly low vulnerability clustering. Several counties in northern Texas form Low-Low clusters, indicating zones of relative social stability or advantage. These are areas where communities have comparatively lower vulnerability indicators (e.g., higher socioeconomic status, better infrastructure, etc.), and neighboring counties share similarly low vulnerability levels, reinforcing a cluster of low SoVI. The Low-Low clusters signal regions of Texas that are comparatively less vulnerable (potentially more resilient) in the social factors measured by the index. In addition to clusters, the Local Moran’s I analysis highlights specific outlier counties that deviate from their neighbors. A High-Low (HL) outlier denotes a county with a high SoVI value that is surrounded primarily by counties with lower SoVI. Conversely, a Low-High (LH) outlier is a low-vulnerability county bordered by higher-vulnerability neighbors.

Name		Name	Last commit message	Last commit date
Latest commit History 62 Commits
1_Project_Raw_Data.xlsx		1_Project_Raw_Data.xlsx
2_Project_Processed_Data.csv		2_Project_Processed_Data.csv
3_Project_PCA_Analysis_Data.csv		3_Project_PCA_Analysis_Data.csv
4_Project_PCA_Analysis_Processed.xlsx		4_Project_PCA_Analysis_Processed.xlsx
Heatmap of PCA Loadings.png		Heatmap of PCA Loadings.png
Incremental_SA.pdf		Incremental_SA.pdf
Local Moran's I Cluster Analysis of Social Vulnerability Index in Texas.jpg		Local Moran's I Cluster Analysis of Social Vulnerability Index in Texas.jpg
Morans I result Texas.jpg		Morans I result Texas.jpg
PCA Analysis Part in RStudio.R		PCA Analysis Part in RStudio.R
Project_Machine_Learning_Course.ipynb		Project_Machine_Learning_Course.ipynb
README.md		README.md
SoVI Map.jpg		SoVI Map.jpg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Machine-Learning-Course-Project

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Machine-Learning-Course-Project

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages