# Test Script: Street Name Geospatial Imputation

**Objective:** This script validates the accuracy of the SAM Addresses shapefile and the **nearest-neighbor join** (`sjoin_nearest`) used to impute missing `location_street_name` values.

**Methodology:**
This test uses a curated list of 10 specific, real-world addresses from across Boston and their corresponding coordinates. For each location, it performs a `gpd.sjoin_nearest` to find the closest address point in the SAM dataset. It then constructs the address from the `STREET_NUM` and `FULL_STREE` columns and compares it to the expected address.

**Expected Outcome:**
A successful test will demonstrate that the function consistently finds the correct address or an immediate neighbor. Minor discrepancies due to address ranges (e.g., "15-17" vs. "15") or coordinate precision are expected and acceptable for our project's analytical goals. The key is to confirm that the imputation is geographically precise and contextually correct.

In [25]:
import pandas as pd
import geopandas as gpd
from shapely.geometry import Point
from pathlib import Path

# --- 1. Load the local SAM Addresses shapefile ---
shapefile_path = Path("../data/processed/live_street_address_management_sam_addresses/Live_Street_Address_Management_SAM_Addresses.shp")
sam_gdf = gpd.read_file(shapefile_path)
print(f"Loading shapefile from: {shapefile_path}")

# --- 2. Define 10 specific address points for a more reliable test ---
test_points = {
    'Location': [
        'Egremont Rd Residence', 'Farrington Ave Residence', 'Happy Lamb Brighton', 'Marlborough St Residence', 'Comm Ave Residence', 'Cedar St Residence',
        '7INK Boston', 'Aguadilla St Residence', '345 Harrison Apartments', 'Thrive Medical Care'
    ],
    'Latitude': [
        42.3416, 42.3543, 42.3529, 42.3532, 42.3537, 42.3582, 42.3456, 42.3418, 42.3448, 42.3462
    ],
    'Longitude': [
        -71.1426, -71.1314, -71.1317, -71.0778, -71.0728, -71.0697, -71.0615, -71.0740, -71.0639, -71.0709
    ],
    'Expected Address': [
        '41 Egremont Rd', '15 Farrington Ave', '138 Brighton Ave', '127 Marlborough St', '11 Commonwealth Ave', '20 W Cedar St', '217 Albany St', 
        '6 Aguadilla St', '345 Harrison Ave', '11 Appleton St'
    ]
}
test_df = pd.DataFrame(test_points)

# --- 3. Convert the test data into a GeoDataFrame ---
geometry = [Point(xy) for xy in zip(test_df['Longitude'], test_df['Latitude'])]
test_gdf = gpd.GeoDataFrame(test_df, geometry=geometry, crs="EPSG:4326")

# --- 4. Align CRS to a projected system for accurate distance calculation ---
test_gdf = test_gdf.to_crs("EPSG:2249")
sam_gdf = sam_gdf.to_crs("EPSG:2249")

# --- 5. Perform the nearest-neighbor join ---
address_components = ['STREET_NUM', 'FULL_STREE']
results_gdf = gpd.sjoin_nearest(test_gdf, sam_gdf[address_components + ['geometry']], how="left")

# --- 6. Compare the results and print the report ---
results_gdf['Found Address'] = (
    results_gdf['STREET_NUM'].astype(str) + ' ' + results_gdf['FULL_STREE'].astype(str)
)
results_gdf['Match'] = results_gdf['Expected Address'] == results_gdf['Found Address']

print("\n--- TEST REPORT --- ✅")
results_gdf.drop_duplicates(subset='Location')[['Location', 'Expected Address', 'Found Address', 'Match']]

Loading shapefile from: ../data/processed/live_street_address_management_sam_addresses/Live_Street_Address_Management_SAM_Addresses.shp

--- TEST REPORT --- ✅


Unnamed: 0,Location,Expected Address,Found Address,Match
0,Egremont Rd Residence,41 Egremont Rd,41 Egremont Rd,True
1,Farrington Ave Residence,15 Farrington Ave,15-17 Farrington Ave,False
2,Happy Lamb Brighton,138 Brighton Ave,138 Brighton Ave,True
3,Marlborough St Residence,127 Marlborough St,125 Marlborough St,False
4,Comm Ave Residence,11 Commonwealth Ave,11 Commonwealth Ave,True
5,Cedar St Residence,20 W Cedar St,20 W Cedar St,True
6,7INK Boston,217 Albany St,217 Albany St,True
7,Aguadilla St Residence,6 Aguadilla St,6A Aguadilla St,False
8,345 Harrison Apartments,345 Harrison Ave,345 Harrison Ave,True
9,Thrive Medical Care,11 Appleton St,11A Appleton St,False


### Analysis of Imputation Test Results

The test results are excellent and confirm that the nearest-neighbor join is a highly effective and reliable method for this task. The `True` matches show the process is working perfectly for the majority of cases.

The few `False` matches are not failures of the method, but rather highlight the nuances of real-world address data:

* **Address Ranges:** The official city address database sometimes defines a single point for a range of numbers (e.g., finding `15-17 Farrington Ave` for `15 Farrington Ave`). Our coordinate correctly falls within this defined parcel.
* **Unit Specificity:** In some cases, the nearest registered address point is for a specific unit (e.g., `6A Aguadilla St`) rather than the base building number (`6 Aguadilla St`).
* **Coordinate Precision:** A slight imprecision in the lookup coordinates for our test can result in matching with the house next door (e.g., `125 Marlborough St` instead of `127 Marlborough St`).

### Conclusion: Accuracy is Sufficient

For the purposes of our 311 analysis, this level of accuracy is more than sufficient. The goal of this imputation is to provide clear, contextual street names for records that were missing them. The test proves that our method correctly identifies either the exact building or its immediate neighbor. This is a successful validation, and we can confidently proceed with this imputation method.