# Imprecise Country Matching with `pycountry`

The structured address has fields that each have their own semantics. Tools using databases about a specific field can help match address components.

If you have any valid ISO nation abbreviation or long form name, [pycountry](https://pypi.org/project/pycountry/) ([github](https://github.com/pycountry/pycountry)) is a PyPi module that can retrieve the actual country for it. This enables efficient comparison. In this notebook we try out using `pycountry` to enable country matching to utilize its database of valid names and pattern matching capability.

In [None]:
import logging
import random
import re
import sys
from typing import Literal

import numpy as np
import pandas as pd
import pycountry
from postal.parser import parse_address

from utils import (
    augment_gold_labels,
    format_dataset,
    gold_label_report,
    to_dict,
)

#### Pin Random Seeds for Reproducibility

In [None]:
RANDOM_SEED = 31337

random.seed(RANDOM_SEED)
np.random.seed(RANDOM_SEED)

#### Setup Basic Logging

In [None]:
logging.basicConfig(stream=sys.stderr, level=logging.ERROR)

logger = logging.getLogger(__name__)

#### Configure Pandas to Show More Rows

In [None]:
pd.set_option("display.max_rows", 40)
pd.set_option("display.max_columns", None)

## Matching Country Names with `pycountry`

In [None]:
def match_country_names(country1: str, country2: str) -> Literal[0, 1]:
    """match_country_strings - compare and match varying country formats using pycountry"""

    # Remove any punctuation from the country
    def remove_punctuation(country: str) -> str:
        # Use re.sub to replace all punctuation characters with an empty string
        return re.sub(r"[^\w\s]", "", country)

    def multi_lookup(**kwargs):
        """Try each key until we retrieve a result"""
        for arg, value in kwargs.items():
            result = pycountry.countries.get(**{arg: value})
            if result:
                return result

    def get_args(country: str):
        """Compose pycountries.countries.get arguments dict based on length of country string"""
        args = {}
        if country and len(country) == 2:
            args["alpha_2"] = country
        elif country and len(country) == 3:
            args["alpha_3"] = country
        elif country:
            args["name"] = country
            args["common_name"] = country
            args["official_name"] = country
        return args

    try:
        pycountry1 = multi_lookup(**get_args(remove_punctuation(country1)))
        pycountry2 = multi_lookup(**get_args(remove_punctuation(country2)))

        return 1 if pycountry1.name == pycountry2.name else 0
    except AttributeError:
        return 0

In [None]:
match_country_names("sg", "singapore")

In [None]:
match_country_names("usa", "united states")

In [None]:
# Didn't match until I added and called remove_punctuation(country: str) -> str
match_country_names("U.S.A.", "United States of America")

In [None]:
match_country_names("USA", "MEX")

In [None]:
match_country_names("United States", "United Mexican States")

### Country Parsing in Structured Matching

Let's use our new method `match_pycountry(country1: str, country2: str) -> Literal[0, 1]` matcher to improve our original structured matcher. This will allow it to contain varying country formats and still match. This makes the matcher more robust. 

In order to make this work we have to refactor our code to create matching functions for each field. Note that we are leaving out matching states, as they aren't required if the road name, number, unit and postal code match.

In [None]:
def parse_match_address_country(address1: str, address2: str) -> Literal[0, 1]:
    """parse_match_address_country implements address matching like parse_match_address() but with pycountry country matching"""
    address1 = to_dict(parse_address(address1))
    address2 = to_dict(parse_address(address2))

    def match_road(address1: str, address2: str) -> Literal[0, 1]:
        """match_road - literal road matching, negative if either lacks a road"""
        if ("road" in address1) and ("road" in address2):
            if address1["road"] == address2["road"]:
                logger.debug("road match")
                return 1
            else:
                logger.debug("road mismatch")
                return 0
        logger.debug("road mismatch")
        return 0

    def match_house_number(address1: str, address2: str) -> Literal[0, 1]:
        """match_house_number - literal house number matching, negative if either lacks a house_number"""
        if ("house_number" in address1) and ("house_number" in address2):
            if address1["house_number"] == address2["house_number"]:
                logger.debug("house_number match")
                return 1
            else:
                logger.debug("house_number mismatch")
                return 0
        logger.debug("house_number mistmatch")
        return 0

    def match_unit(address1: str, address2: str) -> Literal[0, 1]:
        """match_unit - note a missing unit in both is a match"""
        if "unit" in address1:
            if "unit" in address2:
                logger.debug("unit match")
                return 1 if (address1["unit"] == address2["unit"]) else 0
            else:
                logger.debug("unit mismatch")
                return 0
        if "unit" in address2:
            if "unit" in address1:
                logger.debug("unit match")
                return 1 if (address1["unit"] == address2["unit"]) else 0
            else:
                logger.debug("unit mismatch")
                return 0
        # Neither address has a unit, which is a default match
        return 1

    def match_postcode(address1: str, address2: str) -> Literal[0, 1]:
        """match_postcode - literal matching, negative if either lacks a postal code"""
        if ("postcode" in address1) and ("postcode" in address2):
            if address1["postcode"] == address2["postcode"]:
                logger.debug("postcode match")
                return 1
            else:
                logger.debug("postcode mismatch")
                return 0
        logger.debug("postcode mismatch")
        return 0

    def match_country(address1: str, address2: str) -> Literal[0, 1]:
        """match_country - semantic country matching with pycountry via match_country_names(country1, country2)"""
        if ("country" in address1) and ("country" in address2):
            if match_country_names(address1["country"], address2["country"]):
                logger.debug("country match")
                return 1
            else:
                logger.debug("country mismatch")
                return 0
        # One or none countries should match
        logger.debug("country match")
        return 1

    # Combine the above to get a complete address matcher
    if (
        match_road(address1, address2)
        and match_house_number(address1, address2)
        and match_unit(address1, address2)
        and match_postcode(address1, address2)
        # Our only non-exact match - default to 1, match
        and match_country(address1, address2)
    ):
        logger.debug("overall match")
        return 1
    else:
        logger.debug("overall mismatch")
        return 0

In [None]:
parse_match_address_country(
    "100 Roxas Blvd, Ermita, Manila, 1000 Metro Manila, PH",
    "100 Roxas Blvd, Ermita, Manila, 1000 Metro Manila, Republic of the Philippines"
)

In [None]:
# Defaults to match if no countries are provided
parse_match_address_country(
    "100 King St W, Toronto, ON M5X 1A9",
    "100 King St W, Toronto, ON M5X 1A9",
)

In [None]:
# Defaults to match if only one address has country
parse_match_address_country(
    "100 King St W, Toronto, ON M5X 1A9",
    "100 King St W, Toronto, ON M5X 1A9, Canada",
)

In [None]:
# Verify mismatch
parse_match_address_country(
    "Bosque de Chapultepec I Secc, Miguel Hidalgo, 11850 Ciudad de México, CDMX, Mexico",
    "Bosque de Chapultepec I Secc, Miguel Hidalgo, 11850 Ciudad de México, CDMX, USA"
)

## Gold Label Validation

We need to evaluate this new method against our gold labeled data previously defined in [Address Data Augmentation.ipynb](Address%20Data%20Augmentation.ipynb).

In [None]:
gold_df = pd.read_csv("data/gold.csv")

In [None]:
def strict_parse_match(row: pd.Series) -> pd.Series:
    """strict_parse_match Strict address matching"""
    return parse_match_address(row["Address1"], row["Address2"])


def parse_match_country(row: pd.Series) -> pd.Series:
    """parse_match Strict address matching"""
    return parse_match_address_country(row["Address1"], row["Address2"])

In [None]:
raw_df, grouped_df = gold_label_report(gold_df, [parse_match_country])

In [None]:
grouped_df

In [None]:
true_df = raw_df[raw_df["parse_match_country_correct"]]
print(f"Total accurate matches for strict_parse_match: {len(true_df):,}")

true_df.sort_values(by="Description").reset_index(drop=True)

In [None]:
false_df = raw_df[raw_df["parse_match_country_correct"] == False]
print(f"Total mismatches for strict_parse_match: {len(false_df):,}")

false_df.sort_values(by="Description").reset_index(drop=True)