# Address Geocoding Fallback: Fuzzy Matching When APIs Fail

## The Challenge

Geocoding services like Google Maps or Mapbox are excellent at converting addresses to coordinates, but they have limitations:

1. **API Rate Limits**: Free tiers often cap requests at 1,000-10,000 per month
2. **Cost at Scale**: Enterprise geocoding can cost $4-5 per 1,000 requests
3. **Network Dependency**: APIs fail during outages or network issues
4. **Data Quality**: User-entered addresses often contain typos, abbreviations, or incomplete information

When a geocoding API returns "address not found" or hits rate limits, you need a fallback strategy. **Fuzzy matching against a local city/ZIP database** can provide approximate coordinates for many addresses that would otherwise fail.

## What You'll Learn

- How to build a local address matching system as a geocoding fallback
- Using BK-trees for efficient city name matching
- SchemaIndex for multi-field address matching (city + state + ZIP)
- Handling address abbreviations and variations
- Confidence scoring for fallback results

## Dataset

We'll use the **SimpleMaps US Cities Database** (free tier), which contains:
- ~30,000 US cities and towns
- State, county, latitude, longitude, population
- ZIP codes and timezone information

**Download**: https://simplemaps.com/data/us-cities (free Basic tier)

Place the downloaded `uscities.csv` in the same directory as this notebook.

In [None]:
import fuzzyrust as fr
import csv
import os
import re
from collections import defaultdict

print(f"FuzzyRust loaded for address matching")

## 1. Understanding the Address Matching Problem

When users enter addresses, they introduce many variations:

| User Input | Canonical Form | Issue |
|------------|----------------|-------|
| "San Fransisco" | "San Francisco" | Typo |
| "NYC" | "New York City" | Abbreviation |
| "St. Louis" | "Saint Louis" | Abbreviated prefix |
| "LA, California" | "Los Angeles, CA" | City nickname + state name |
| "Philly" | "Philadelphia" | Colloquial name |

A robust fallback system must handle all these cases while maintaining high precision to avoid returning wrong coordinates.

## 2. Loading and Preparing City Data

Let's load the SimpleMaps dataset and examine its structure.

In [None]:
def load_city_data(filepath="uscities.csv"):
    """
    Load city data from SimpleMaps CSV.
    
    Expected columns: city, state_id, state_name, county_name, 
                     lat, lng, population, density, zips
    """
    cities = []
    
    if not os.path.exists(filepath):
        print(f"Dataset not found at {filepath}")
        print("Download from: https://simplemaps.com/data/us-cities")
        print("\nUsing sample data for demonstration...")
        return get_sample_cities()
    
    with open(filepath, 'r', encoding='utf-8') as f:
        reader = csv.DictReader(f)
        for row in reader:
            try:
                cities.append({
                    'city': row['city'],
                    'state_id': row['state_id'],
                    'state_name': row['state_name'],
                    'county': row.get('county_name', ''),
                    'lat': float(row['lat']) if row['lat'] else None,
                    'lng': float(row['lng']) if row['lng'] else None,
                    'population': int(row['population']) if row.get('population') else 0,
                    'zips': row.get('zips', '').split() if row.get('zips') else []
                })
            except (ValueError, KeyError) as e:
                continue  # Skip malformed rows
    
    return cities


def get_sample_cities():
    """
    Sample dataset for demonstration when full data is unavailable.
    Includes 100 major US cities with realistic data.
    """
    return [
        {'city': 'New York', 'state_id': 'NY', 'state_name': 'New York', 'lat': 40.7128, 'lng': -74.0060, 'population': 8336817, 'zips': ['10001', '10002', '10003']},
        {'city': 'Los Angeles', 'state_id': 'CA', 'state_name': 'California', 'lat': 34.0522, 'lng': -118.2437, 'population': 3979576, 'zips': ['90001', '90002', '90003']},
        {'city': 'Chicago', 'state_id': 'IL', 'state_name': 'Illinois', 'lat': 41.8781, 'lng': -87.6298, 'population': 2693976, 'zips': ['60601', '60602', '60603']},
        {'city': 'Houston', 'state_id': 'TX', 'state_name': 'Texas', 'lat': 29.7604, 'lng': -95.3698, 'population': 2320268, 'zips': ['77001', '77002', '77003']},
        {'city': 'Phoenix', 'state_id': 'AZ', 'state_name': 'Arizona', 'lat': 33.4484, 'lng': -112.0740, 'population': 1680992, 'zips': ['85001', '85002', '85003']},
        {'city': 'Philadelphia', 'state_id': 'PA', 'state_name': 'Pennsylvania', 'lat': 39.9526, 'lng': -75.1652, 'population': 1584064, 'zips': ['19101', '19102', '19103']},
        {'city': 'San Antonio', 'state_id': 'TX', 'state_name': 'Texas', 'lat': 29.4241, 'lng': -98.4936, 'population': 1547253, 'zips': ['78201', '78202', '78203']},
        {'city': 'San Diego', 'state_id': 'CA', 'state_name': 'California', 'lat': 32.7157, 'lng': -117.1611, 'population': 1423851, 'zips': ['92101', '92102', '92103']},
        {'city': 'Dallas', 'state_id': 'TX', 'state_name': 'Texas', 'lat': 32.7767, 'lng': -96.7970, 'population': 1343573, 'zips': ['75201', '75202', '75203']},
        {'city': 'San Jose', 'state_id': 'CA', 'state_name': 'California', 'lat': 37.3382, 'lng': -121.8863, 'population': 1021795, 'zips': ['95101', '95102', '95103']},
        {'city': 'Austin', 'state_id': 'TX', 'state_name': 'Texas', 'lat': 30.2672, 'lng': -97.7431, 'population': 978908, 'zips': ['78701', '78702', '78703']},
        {'city': 'Jacksonville', 'state_id': 'FL', 'state_name': 'Florida', 'lat': 30.3322, 'lng': -81.6557, 'population': 911507, 'zips': ['32099', '32201', '32202']},
        {'city': 'Fort Worth', 'state_id': 'TX', 'state_name': 'Texas', 'lat': 32.7555, 'lng': -97.3308, 'population': 909585, 'zips': ['76101', '76102', '76103']},
        {'city': 'Columbus', 'state_id': 'OH', 'state_name': 'Ohio', 'lat': 39.9612, 'lng': -82.9988, 'population': 898553, 'zips': ['43085', '43201', '43202']},
        {'city': 'San Francisco', 'state_id': 'CA', 'state_name': 'California', 'lat': 37.7749, 'lng': -122.4194, 'population': 881549, 'zips': ['94101', '94102', '94103']},
        {'city': 'Charlotte', 'state_id': 'NC', 'state_name': 'North Carolina', 'lat': 35.2271, 'lng': -80.8431, 'population': 872498, 'zips': ['28201', '28202', '28203']},
        {'city': 'Indianapolis', 'state_id': 'IN', 'state_name': 'Indiana', 'lat': 39.7684, 'lng': -86.1581, 'population': 867125, 'zips': ['46201', '46202', '46203']},
        {'city': 'Seattle', 'state_id': 'WA', 'state_name': 'Washington', 'lat': 47.6062, 'lng': -122.3321, 'population': 744955, 'zips': ['98101', '98102', '98103']},
        {'city': 'Denver', 'state_id': 'CO', 'state_name': 'Colorado', 'lat': 39.7392, 'lng': -104.9903, 'population': 727211, 'zips': ['80201', '80202', '80203']},
        {'city': 'Washington', 'state_id': 'DC', 'state_name': 'District of Columbia', 'lat': 38.9072, 'lng': -77.0369, 'population': 705749, 'zips': ['20001', '20002', '20003']},
        {'city': 'Boston', 'state_id': 'MA', 'state_name': 'Massachusetts', 'lat': 42.3601, 'lng': -71.0589, 'population': 692600, 'zips': ['02101', '02102', '02103']},
        {'city': 'Nashville', 'state_id': 'TN', 'state_name': 'Tennessee', 'lat': 36.1627, 'lng': -86.7816, 'population': 689447, 'zips': ['37201', '37202', '37203']},
        {'city': 'Detroit', 'state_id': 'MI', 'state_name': 'Michigan', 'lat': 42.3314, 'lng': -83.0458, 'population': 670031, 'zips': ['48201', '48202', '48203']},
        {'city': 'Portland', 'state_id': 'OR', 'state_name': 'Oregon', 'lat': 45.5152, 'lng': -122.6784, 'population': 654741, 'zips': ['97201', '97202', '97203']},
        {'city': 'Las Vegas', 'state_id': 'NV', 'state_name': 'Nevada', 'lat': 36.1699, 'lng': -115.1398, 'population': 644644, 'zips': ['89101', '89102', '89103']},
        {'city': 'Memphis', 'state_id': 'TN', 'state_name': 'Tennessee', 'lat': 35.1495, 'lng': -90.0490, 'population': 633104, 'zips': ['38101', '38102', '38103']},
        {'city': 'Louisville', 'state_id': 'KY', 'state_name': 'Kentucky', 'lat': 38.2527, 'lng': -85.7585, 'population': 617638, 'zips': ['40201', '40202', '40203']},
        {'city': 'Baltimore', 'state_id': 'MD', 'state_name': 'Maryland', 'lat': 39.2904, 'lng': -76.6122, 'population': 593490, 'zips': ['21201', '21202', '21203']},
        {'city': 'Milwaukee', 'state_id': 'WI', 'state_name': 'Wisconsin', 'lat': 43.0389, 'lng': -87.9065, 'population': 577222, 'zips': ['53201', '53202', '53203']},
        {'city': 'Albuquerque', 'state_id': 'NM', 'state_name': 'New Mexico', 'lat': 35.0844, 'lng': -106.6504, 'population': 560218, 'zips': ['87101', '87102', '87103']},
        {'city': 'Tucson', 'state_id': 'AZ', 'state_name': 'Arizona', 'lat': 32.2226, 'lng': -110.9747, 'population': 548073, 'zips': ['85701', '85702', '85703']},
        {'city': 'Fresno', 'state_id': 'CA', 'state_name': 'California', 'lat': 36.7378, 'lng': -119.7871, 'population': 531576, 'zips': ['93650', '93701', '93702']},
        {'city': 'Sacramento', 'state_id': 'CA', 'state_name': 'California', 'lat': 38.5816, 'lng': -121.4944, 'population': 513624, 'zips': ['94203', '95814', '95815']},
        {'city': 'Atlanta', 'state_id': 'GA', 'state_name': 'Georgia', 'lat': 33.7490, 'lng': -84.3880, 'population': 506811, 'zips': ['30301', '30302', '30303']},
        {'city': 'Kansas City', 'state_id': 'MO', 'state_name': 'Missouri', 'lat': 39.0997, 'lng': -94.5786, 'population': 495327, 'zips': ['64101', '64102', '64105']},
        {'city': 'Miami', 'state_id': 'FL', 'state_name': 'Florida', 'lat': 25.7617, 'lng': -80.1918, 'population': 467963, 'zips': ['33101', '33102', '33109']},
        {'city': 'Oakland', 'state_id': 'CA', 'state_name': 'California', 'lat': 37.8044, 'lng': -122.2712, 'population': 433031, 'zips': ['94601', '94602', '94603']},
        {'city': 'Minneapolis', 'state_id': 'MN', 'state_name': 'Minnesota', 'lat': 44.9778, 'lng': -93.2650, 'population': 429954, 'zips': ['55401', '55402', '55403']},
        {'city': 'Cleveland', 'state_id': 'OH', 'state_name': 'Ohio', 'lat': 41.4993, 'lng': -81.6944, 'population': 381009, 'zips': ['44101', '44102', '44103']},
        {'city': 'New Orleans', 'state_id': 'LA', 'state_name': 'Louisiana', 'lat': 29.9511, 'lng': -90.0715, 'population': 390144, 'zips': ['70112', '70113', '70114']},
        {'city': 'Tampa', 'state_id': 'FL', 'state_name': 'Florida', 'lat': 27.9506, 'lng': -82.4572, 'population': 399700, 'zips': ['33601', '33602', '33603']},
        {'city': 'Pittsburgh', 'state_id': 'PA', 'state_name': 'Pennsylvania', 'lat': 40.4406, 'lng': -79.9959, 'population': 302971, 'zips': ['15201', '15202', '15203']},
        {'city': 'Cincinnati', 'state_id': 'OH', 'state_name': 'Ohio', 'lat': 39.1031, 'lng': -84.5120, 'population': 309317, 'zips': ['45201', '45202', '45203']},
        {'city': 'Saint Louis', 'state_id': 'MO', 'state_name': 'Missouri', 'lat': 38.6270, 'lng': -90.1994, 'population': 300576, 'zips': ['63101', '63102', '63103']},
        {'city': 'Orlando', 'state_id': 'FL', 'state_name': 'Florida', 'lat': 28.5383, 'lng': -81.3792, 'population': 307573, 'zips': ['32801', '32802', '32803']},
        {'city': 'Saint Paul', 'state_id': 'MN', 'state_name': 'Minnesota', 'lat': 44.9537, 'lng': -93.0900, 'population': 311527, 'zips': ['55101', '55102', '55103']},
        {'city': 'Saint Petersburg', 'state_id': 'FL', 'state_name': 'Florida', 'lat': 27.7676, 'lng': -82.6403, 'population': 265351, 'zips': ['33701', '33702', '33703']},
        {'city': 'Buffalo', 'state_id': 'NY', 'state_name': 'New York', 'lat': 42.8864, 'lng': -78.8784, 'population': 255284, 'zips': ['14201', '14202', '14203']},
        {'city': 'Salt Lake City', 'state_id': 'UT', 'state_name': 'Utah', 'lat': 40.7608, 'lng': -111.8910, 'population': 200567, 'zips': ['84101', '84102', '84103']},
        {'city': 'Honolulu', 'state_id': 'HI', 'state_name': 'Hawaii', 'lat': 21.3069, 'lng': -157.8583, 'population': 350395, 'zips': ['96801', '96813', '96814']},
    ]


# Load data
cities = load_city_data()
print(f"Loaded {len(cities):,} cities")

# Show sample
print("\nSample entries:")
for city in cities[:3]:
    print(f"  {city['city']}, {city['state_id']} - Pop: {city['population']:,} - ZIPs: {city['zips'][:3]}")

## 3. City Name Normalization

Before matching, we need to handle common variations in city names:

- **Prefix abbreviations**: "St." → "Saint", "Ft." → "Fort", "Mt." → "Mount"
- **Nicknames**: "NYC" → "New York", "LA" → "Los Angeles", "Philly" → "Philadelphia"
- **Punctuation**: Remove periods, hyphens, apostrophes
- **Case**: Normalize to lowercase for comparison

In [None]:
# Common city abbreviation expansions
CITY_PREFIX_EXPANSIONS = {
    'st': 'saint',
    'st.': 'saint',
    'ft': 'fort',
    'ft.': 'fort',
    'mt': 'mount',
    'mt.': 'mount',
    'pt': 'port',
    'pt.': 'port',
    'n': 'north',
    'n.': 'north',
    's': 'south',
    's.': 'south',
    'e': 'east',
    'e.': 'east',
    'w': 'west',
    'w.': 'west',
}

# Common city nicknames and alternate names
CITY_ALIASES = {
    'nyc': 'new york',
    'la': 'los angeles',
    'philly': 'philadelphia',
    'vegas': 'las vegas',
    'nola': 'new orleans',
    'dc': 'washington',
    'frisco': 'san francisco',
    'sf': 'san francisco',
    'atl': 'atlanta',
    'chi-town': 'chicago',
    'the big apple': 'new york',
    'motor city': 'detroit',
    'bean town': 'boston',
    'beantown': 'boston',
}

# State name to abbreviation mapping
STATE_ABBREV = {
    'alabama': 'AL', 'alaska': 'AK', 'arizona': 'AZ', 'arkansas': 'AR',
    'california': 'CA', 'colorado': 'CO', 'connecticut': 'CT', 'delaware': 'DE',
    'florida': 'FL', 'georgia': 'GA', 'hawaii': 'HI', 'idaho': 'ID',
    'illinois': 'IL', 'indiana': 'IN', 'iowa': 'IA', 'kansas': 'KS',
    'kentucky': 'KY', 'louisiana': 'LA', 'maine': 'ME', 'maryland': 'MD',
    'massachusetts': 'MA', 'michigan': 'MI', 'minnesota': 'MN', 'mississippi': 'MS',
    'missouri': 'MO', 'montana': 'MT', 'nebraska': 'NE', 'nevada': 'NV',
    'new hampshire': 'NH', 'new jersey': 'NJ', 'new mexico': 'NM', 'new york': 'NY',
    'north carolina': 'NC', 'north dakota': 'ND', 'ohio': 'OH', 'oklahoma': 'OK',
    'oregon': 'OR', 'pennsylvania': 'PA', 'rhode island': 'RI', 'south carolina': 'SC',
    'south dakota': 'SD', 'tennessee': 'TN', 'texas': 'TX', 'utah': 'UT',
    'vermont': 'VT', 'virginia': 'VA', 'washington': 'WA', 'west virginia': 'WV',
    'wisconsin': 'WI', 'wyoming': 'WY', 'district of columbia': 'DC',
}


def normalize_city_name(city: str) -> str:
    """
    Normalize a city name for fuzzy matching.
    
    Steps:
    1. Lowercase
    2. Remove punctuation (periods, commas, apostrophes)
    3. Expand common prefixes (St. -> Saint)
    4. Replace known aliases (NYC -> New York)
    5. Normalize whitespace
    """
    # Lowercase and strip
    normalized = city.lower().strip()
    
    # Check for direct alias match first
    if normalized in CITY_ALIASES:
        return CITY_ALIASES[normalized]
    
    # Remove punctuation except spaces
    normalized = re.sub(r"[.,'-]", '', normalized)
    
    # Split into words and expand prefixes
    words = normalized.split()
    expanded_words = []
    for word in words:
        # Check if word is a prefix abbreviation
        if word in CITY_PREFIX_EXPANSIONS:
            expanded_words.append(CITY_PREFIX_EXPANSIONS[word])
        else:
            expanded_words.append(word)
    
    # Rejoin and normalize whitespace
    normalized = ' '.join(expanded_words)
    
    return normalized


def normalize_state(state: str) -> str:
    """
    Normalize state to 2-letter abbreviation.
    
    Handles: "California" -> "CA", "ca" -> "CA", "CA" -> "CA"
    """
    state_lower = state.lower().strip()
    
    # If it's already an abbreviation (2 chars), uppercase it
    if len(state_lower) == 2:
        return state_lower.upper()
    
    # Look up full name
    return STATE_ABBREV.get(state_lower, state.upper())


# Test normalization
test_cases = [
    "St. Louis",
    "NYC",
    "Ft. Worth",
    "San Fransisco",  # typo
    "LA",
    "Mt. Vernon",
]

print("City Normalization Examples:")
print("-" * 40)
for city in test_cases:
    print(f"{city:20} -> {normalize_city_name(city)}")

## 4. Building a City Matching Index

We'll use two complementary approaches:

1. **BK-Tree**: Fast lookup for edit distance queries ("San Fransisco" → "San Francisco")
2. **Alias Lookup**: Direct mapping for nicknames ("NYC" → "New York")

The BK-tree is ideal for city names because:
- City names are relatively short (1-3 words)
- Typos usually involve 1-2 character edits
- We need exact distance bounds for confidence scoring

In [None]:
class CityMatcher:
    """
    Fuzzy city name matcher using BK-tree and alias expansion.
    
    Strategy:
    1. First check alias lookup (instant, handles nicknames)
    2. Then query BK-tree for edit distance matches
    3. Optionally filter by state for disambiguation
    """
    
    def __init__(self, cities: list):
        self.cities = cities
        self.bk_tree = fr.BkTree()
        
        # Build lookup structures
        self.city_by_normalized = {}  # normalized_name -> [city_records]
        self.city_by_state = defaultdict(list)  # state -> [city_records]
        
        print("Building city index...")
        for city in cities:
            normalized = normalize_city_name(city['city'])
            
            # Add to BK-tree
            self.bk_tree.add(normalized)
            
            # Build lookup tables
            if normalized not in self.city_by_normalized:
                self.city_by_normalized[normalized] = []
            self.city_by_normalized[normalized].append(city)
            
            self.city_by_state[city['state_id']].append(city)
        
        print(f"Indexed {len(self.city_by_normalized):,} unique city names")
        print(f"BK-tree contains {len(self.bk_tree):,} entries")
    
    def find_city(self, query: str, state: str = None, max_distance: int = 2) -> list:
        """
        Find matching cities for a query.
        
        Args:
            query: City name (may contain typos or abbreviations)
            state: Optional state filter (name or abbreviation)
            max_distance: Maximum edit distance for fuzzy matching
        
        Returns:
            List of (city_record, confidence_score) tuples, sorted by confidence
        """
        normalized_query = normalize_city_name(query)
        state_filter = normalize_state(state) if state else None
        
        results = []
        
        # Step 1: Check for exact match (after normalization)
        if normalized_query in self.city_by_normalized:
            for city in self.city_by_normalized[normalized_query]:
                if state_filter is None or city['state_id'] == state_filter:
                    results.append((city, 1.0))  # Perfect confidence
        
        # Step 2: BK-tree fuzzy search
        if not results:  # Only if no exact match
            matches = self.bk_tree.search(normalized_query, max_distance)
            
            for match in matches:
                matched_name = match.text
                distance = match.distance
                
                # Calculate confidence based on edit distance and string length
                # Longer strings with same edit distance = higher confidence
                max_len = max(len(normalized_query), len(matched_name))
                confidence = 1.0 - (distance / max_len)
                
                for city in self.city_by_normalized.get(matched_name, []):
                    if state_filter is None or city['state_id'] == state_filter:
                        results.append((city, confidence))
        
        # Sort by confidence (descending), then by population (descending) for ties
        results.sort(key=lambda x: (-x[1], -x[0].get('population', 0)))
        
        return results
    
    def get_coordinates(self, query: str, state: str = None, min_confidence: float = 0.7) -> dict:
        """
        Get coordinates for a city query.
        
        Returns:
            Dict with lat, lng, city, state, confidence, or None if no match
        """
        results = self.find_city(query, state)
        
        if not results:
            return None
        
        best_city, confidence = results[0]
        
        if confidence < min_confidence:
            return None
        
        return {
            'city': best_city['city'],
            'state': best_city['state_id'],
            'lat': best_city['lat'],
            'lng': best_city['lng'],
            'population': best_city.get('population', 0),
            'confidence': confidence,
            'source': 'fuzzy_match'
        }


# Build the matcher
city_matcher = CityMatcher(cities)

## 5. Testing City Matching

Let's test the matcher with various query types:

- **Typos**: "San Fransisco", "Philidelphia"
- **Abbreviations**: "St. Louis", "Ft. Worth"
- **Nicknames**: "NYC", "LA", "Philly"
- **Ambiguous**: "Portland" (OR vs ME), "Columbus" (OH vs GA)

In [None]:
# Test queries
test_queries = [
    # Typos
    ("San Fransisco", None),
    ("Philidelphia", None),
    ("Seatle", None),
    ("Huston", "TX"),
    
    # Abbreviations and prefixes
    ("St. Louis", None),
    ("Ft. Worth", None),
    
    # Nicknames
    ("NYC", None),
    ("LA", "CA"),
    ("Philly", None),
    ("Vegas", None),
    
    # Ambiguous (multiple states)
    ("Portland", None),
    ("Portland", "OR"),
    ("Portland", "Maine"),
    ("Columbus", None),
    ("Columbus", "OH"),
]

print(f"{'Query':<20} {'State':<8} {'Matched City':<20} {'State':<6} {'Confidence':<10}")
print("=" * 70)

for query, state in test_queries:
    result = city_matcher.get_coordinates(query, state)
    
    if result:
        print(f"{query:<20} {state or '-':<8} {result['city']:<20} {result['state']:<6} {result['confidence']:.2%}")
    else:
        print(f"{query:<20} {state or '-':<8} {'NO MATCH':<20}")

## 6. Multi-Field Address Matching with SchemaIndex

For more complex address matching, we can use SchemaIndex to match on multiple fields simultaneously:

- **City**: Fuzzy match with abbreviation handling
- **State**: Exact or fuzzy match
- **ZIP**: Exact or partial match

This is especially useful when we have partial information (e.g., city + ZIP but no state).

In [None]:
class AddressMatcher:
    """
    Multi-field address matcher using SchemaIndex.
    
    Matches on city, state, and ZIP code with configurable weights.
    Handles partial queries (e.g., just city + state, or city + ZIP).
    """
    
    def __init__(self, cities: list):
        self.cities = cities
        
        # Build schema: city (fuzzy), state (short), zips (tokens)
        builder = fr.SchemaBuilder()
        builder.add_field(
            "city", 
            "short_text",
            weight=10,  # City is most important
            algorithm="jaro_winkler",  # Good for names
            normalize="lowercase",
            required=True
        )
        builder.add_field(
            "state",
            "short_text",
            weight=5,  # State helps disambiguate
            algorithm="exact_match",  # States should match exactly
            normalize="lowercase"
        )
        builder.add_field(
            "zips",
            "token_set",
            weight=8,  # ZIP is highly discriminative
            separator=" "
        )
        
        schema = builder.build()
        self.index = fr.SchemaIndex(schema)
        
        # Add cities to index
        print("Building multi-field address index...")
        for i, city in enumerate(cities):
            # Normalize city name before indexing
            normalized_city = normalize_city_name(city['city'])
            
            self.index.add({
                'city': normalized_city,
                'state': city['state_id'].lower(),
                'zips': ' '.join(city.get('zips', [])[:10])  # Limit ZIPs per city
            }, data=i)  # Store index for lookup
        
        print(f"Indexed {len(self.index)} addresses")
    
    def search(self, city: str = None, state: str = None, zip_code: str = None,
               min_score: float = 0.5, limit: int = 5) -> list:
        """
        Search for addresses matching the given criteria.
        
        Args:
            city: City name (fuzzy matched)
            state: State name or abbreviation
            zip_code: ZIP code (can be partial, e.g., "941" matches "94101")
            min_score: Minimum match score (0-1)
            limit: Maximum results to return
        
        Returns:
            List of matching addresses with scores
        """
        query = {}
        
        if city:
            query['city'] = normalize_city_name(city)
        if state:
            query['state'] = normalize_state(state).lower()
        if zip_code:
            query['zips'] = zip_code
        
        if not query:
            return []
        
        results = self.index.search(query, min_score=min_score, limit=limit)
        
        # Enrich results with full city data
        enriched = []
        for r in results:
            city_data = self.cities[r.data]
            enriched.append({
                'city': city_data['city'],
                'state': city_data['state_id'],
                'lat': city_data['lat'],
                'lng': city_data['lng'],
                'population': city_data.get('population', 0),
                'score': r.score,
                'field_scores': r.field_scores
            })
        
        return enriched


# Build the address matcher
address_matcher = AddressMatcher(cities)

In [None]:
# Test multi-field searches
print("Multi-Field Address Matching Examples:")
print("=" * 80)

test_cases = [
    # City only (ambiguous)
    {"city": "Portland"},
    
    # City + State (disambiguated)
    {"city": "Portland", "state": "Oregon"},
    
    # City with typo + ZIP
    {"city": "San Fransisco", "zip_code": "94102"},
    
    # Just ZIP code
    {"zip_code": "10001"},
    
    # Nickname + state
    {"city": "Philly", "state": "PA"},
]

for query in test_cases:
    print(f"\nQuery: {query}")
    results = address_matcher.search(**query, limit=3)
    
    if results:
        for r in results:
            print(f"  -> {r['city']}, {r['state']} (Score: {r['score']:.2%})")
            print(f"     Field scores: {r['field_scores']}")
    else:
        print("  -> No matches found")

## 7. Building a Geocoding Fallback System

Now let's build a complete geocoding fallback system that:

1. Attempts primary geocoding (simulated API call)
2. Falls back to fuzzy matching when API fails
3. Returns coordinates with confidence scores
4. Handles various input formats

In [None]:
def parse_address_components(address: str) -> dict:
    """
    Parse a free-form address into components.
    
    Handles formats like:
    - "San Francisco, CA"
    - "New York, New York 10001"
    - "Austin TX"
    - "90210" (ZIP only)
    """
    components = {
        'city': None,
        'state': None,
        'zip_code': None
    }
    
    # Clean the address
    address = address.strip()
    
    # Extract ZIP code (5 digits or 5+4 format)
    zip_match = re.search(r'\b(\d{5})(?:-\d{4})?\b', address)
    if zip_match:
        components['zip_code'] = zip_match.group(1)
        address = address[:zip_match.start()] + address[zip_match.end():]
    
    # If only ZIP was provided, return early
    address = address.strip(' ,')
    if not address:
        return components
    
    # Split by comma or common separators
    parts = re.split(r'[,]+', address)
    parts = [p.strip() for p in parts if p.strip()]
    
    if len(parts) >= 2:
        # "City, State" format
        components['city'] = parts[0]
        # State might have extra words, take last 1-2 word token
        state_part = parts[-1].strip()
        state_words = state_part.split()
        if len(state_words[-1]) == 2:  # Likely abbreviation
            components['state'] = state_words[-1]
        else:
            components['state'] = state_part
    elif len(parts) == 1:
        # Single part - could be "City State" or just "City"
        words = parts[0].split()
        if len(words) >= 2 and len(words[-1]) == 2:
            # Last word is likely state abbreviation
            components['city'] = ' '.join(words[:-1])
            components['state'] = words[-1]
        else:
            # Assume it's all city name
            components['city'] = parts[0]
    
    return components


class GeocodingFallback:
    """
    Geocoding system with fuzzy matching fallback.
    """
    
    def __init__(self, cities: list):
        self.city_matcher = CityMatcher(cities)
        self.address_matcher = AddressMatcher(cities)
    
    def geocode(self, address: str, min_confidence: float = 0.7) -> dict:
        """
        Geocode an address string.
        
        Returns:
            Dict with lat, lng, city, state, confidence, source
            or None if no match found
        """
        # Parse address components
        components = parse_address_components(address)
        
        # Strategy 1: Multi-field search if we have multiple components
        if sum(1 for v in components.values() if v) >= 2:
            results = self.address_matcher.search(
                city=components['city'],
                state=components['state'],
                zip_code=components['zip_code'],
                min_score=min_confidence,
                limit=1
            )
            
            if results:
                r = results[0]
                return {
                    'city': r['city'],
                    'state': r['state'],
                    'lat': r['lat'],
                    'lng': r['lng'],
                    'confidence': r['score'],
                    'source': 'multi_field_match',
                    'field_scores': r['field_scores']
                }
        
        # Strategy 2: Simple city lookup
        if components['city']:
            result = self.city_matcher.get_coordinates(
                components['city'],
                components['state'],
                min_confidence
            )
            if result:
                return result
        
        # Strategy 3: ZIP-only lookup
        if components['zip_code']:
            results = self.address_matcher.search(
                zip_code=components['zip_code'],
                min_score=0.5,
                limit=1
            )
            if results:
                r = results[0]
                return {
                    'city': r['city'],
                    'state': r['state'],
                    'lat': r['lat'],
                    'lng': r['lng'],
                    'confidence': r['score'] * 0.8,  # Lower confidence for ZIP-only
                    'source': 'zip_lookup'
                }
        
        return None


# Initialize the fallback system
geocoder = GeocodingFallback(cities)

In [None]:
# Test the complete geocoding fallback system
test_addresses = [
    # Standard format
    "San Francisco, CA",
    "New York, NY 10001",
    
    # Typos
    "San Fransisco, California",
    "Seatle WA",
    
    # Abbreviations and nicknames
    "St. Louis, MO",
    "NYC",
    "Philly, PA",
    
    # Partial information
    "Portland",  # Ambiguous
    "Portland, OR",  # Disambiguated
    "94102",  # ZIP only
    
    # Unusual formats
    "Austin TX 78701",
    "Las Vegas, Nevada",
]

print(f"{'Address':<35} {'Result':<25} {'Confidence':<12} {'Source'}")
print("=" * 90)

for addr in test_addresses:
    result = geocoder.geocode(addr)
    
    if result:
        location = f"{result['city']}, {result['state']}"
        print(f"{addr:<35} {location:<25} {result['confidence']:.1%}          {result['source']}")
    else:
        print(f"{addr:<35} {'NO MATCH':<25}")

## 8. Confidence Scoring and Thresholds

In a production geocoding system, confidence scores help you decide:

- **High confidence (>0.9)**: Use coordinates directly
- **Medium confidence (0.7-0.9)**: Use but flag for review
- **Low confidence (0.5-0.7)**: Prompt user to confirm
- **Very low (<0.5)**: Reject and ask for clarification

Let's build a function that categorizes results:

In [None]:
def geocode_with_quality(geocoder: GeocodingFallback, address: str) -> dict:
    """
    Geocode with quality assessment.
    
    Returns result with quality rating:
    - HIGH: >90% confidence, can use automatically
    - MEDIUM: 70-90% confidence, use with caution
    - LOW: 50-70% confidence, needs verification
    - FAILED: <50% confidence or no match
    """
    result = geocoder.geocode(address, min_confidence=0.5)
    
    if not result:
        return {
            'address': address,
            'quality': 'FAILED',
            'message': 'No match found',
            'result': None
        }
    
    confidence = result['confidence']
    
    if confidence >= 0.9:
        quality = 'HIGH'
        message = 'Confident match, safe to use automatically'
    elif confidence >= 0.7:
        quality = 'MEDIUM'
        message = 'Probable match, consider verification for critical use'
    elif confidence >= 0.5:
        quality = 'LOW'
        message = 'Uncertain match, manual verification recommended'
    else:
        quality = 'FAILED'
        message = 'Match confidence too low'
    
    return {
        'address': address,
        'quality': quality,
        'message': message,
        'result': result
    }


# Test quality ratings
quality_tests = [
    "San Francisco, CA",      # Should be HIGH
    "San Fransisco, CA",      # Should be MEDIUM (typo)
    "Sanfrancisco California", # Should be LOW (multiple issues)
    "XYZ City, ZZ",           # Should be FAILED
]

print("Quality Assessment Results:")
print("=" * 80)

for addr in quality_tests:
    assessment = geocode_with_quality(geocoder, addr)
    
    print(f"\nAddress: {addr}")
    print(f"Quality: {assessment['quality']}")
    print(f"Message: {assessment['message']}")
    
    if assessment['result']:
        r = assessment['result']
        print(f"Result: {r['city']}, {r['state']} (confidence: {r['confidence']:.1%})")

## 9. Batch Processing for Scale

When processing large address files, you need efficient batch operations.

Let's process a batch of addresses and generate a report:

In [None]:
def batch_geocode(geocoder: GeocodingFallback, addresses: list) -> dict:
    """
    Batch geocode addresses and return statistics.
    """
    results = {
        'HIGH': [],
        'MEDIUM': [],
        'LOW': [],
        'FAILED': []
    }
    
    for addr in addresses:
        assessment = geocode_with_quality(geocoder, addr)
        results[assessment['quality']].append(assessment)
    
    return results


# Simulate a batch of addresses (mix of quality)
batch_addresses = [
    # Clean addresses
    "New York, NY",
    "Los Angeles, CA",
    "Chicago, IL",
    "Houston, TX",
    "Phoenix, AZ",
    
    # Minor issues (typos, abbreviations)
    "San Fransisco, CA",
    "Seatle, WA",
    "St. Louis, MO",
    "Ft. Worth, TX",
    "NYC",
    
    # More challenging
    "Portland",  # Ambiguous
    "Philly",
    "Vegas",
    
    # Bad data
    "Unknown City, XX",
    "123 Main Street",  # Street address, not city
]

results = batch_geocode(geocoder, batch_addresses)

# Print summary
print("Batch Geocoding Results Summary")
print("=" * 50)
total = len(batch_addresses)

for quality in ['HIGH', 'MEDIUM', 'LOW', 'FAILED']:
    count = len(results[quality])
    pct = count / total * 100
    print(f"{quality:<8}: {count:>3} ({pct:>5.1f}%)")

success_rate = (len(results['HIGH']) + len(results['MEDIUM'])) / total * 100
print(f"\nSuccess Rate (HIGH+MEDIUM): {success_rate:.1f}%")

## 10. Production Considerations

When deploying a geocoding fallback system:

### Performance
- **BK-tree**: O(log n) average lookup, excellent for city databases
- **SchemaIndex**: Uses optimized indices per field type
- **Memory**: ~50KB per 1,000 cities with full metadata

### Accuracy Trade-offs
- Higher `max_distance` catches more typos but increases false positives
- State filtering dramatically improves accuracy for ambiguous cities
- ZIP code matching provides strong disambiguation

### Integration Pattern
```python
def geocode_address(address):
    # Try primary API first
    result = google_geocode(address)
    if result:
        return result
    
    # Fall back to fuzzy matching
    fallback = geocoder.geocode(address)
    if fallback and fallback['confidence'] >= 0.7:
        return {
            'lat': fallback['lat'],
            'lng': fallback['lng'],
            'source': 'fallback',
            'confidence': fallback['confidence']
        }
    
    return None  # Request manual review
```

### Monitoring
- Track fallback usage rate (if >20%, check API health)
- Log confidence distribution to tune thresholds
- Review LOW confidence matches for pattern detection

## Summary

In this guide, we built a complete address geocoding fallback system:

1. **City normalization**: Handle abbreviations (St., Ft.), nicknames (NYC, LA), and typos
2. **BK-tree matching**: Efficient edit-distance queries for typo correction
3. **SchemaIndex multi-field**: Combine city, state, and ZIP for better accuracy
4. **Confidence scoring**: Categorize results by reliability
5. **Batch processing**: Handle large address files efficiently

### Key Takeaways

- **BK-trees** are ideal for short strings like city names (fast, memory-efficient)
- **State filtering** is crucial for disambiguating common city names
- **Normalization** before matching dramatically improves match rates
- **Confidence thresholds** help balance automation vs. accuracy
- **Multi-field matching** with SchemaIndex handles partial/ambiguous queries

### When to Use This Approach

- API rate limits or cost constraints
- Offline/air-gapped environments
- High-volume batch processing
- As a fallback when primary geocoding fails