# Address Data Augmentation with Programmatic Data Labeling

So, you have just arrived from [Address Matching Deep Dive.ipynb](Address%20Matching%20Deep%20Dive.ipynb) and need to generate training data for supervised learning approaches to address matching. Let's get started syngheszing some data!

In this notebook we will start with less than 100 labeled records with descriptions and use the OpenAI GPT4o API to generate records defined by the same semantic as the description for each record. This will give us enough data to train sentence transformer embeddings and deep learning models.

If you started here, move over to the [Address Matching Deep Dive.ipynb](Address%20Matching%20Deep%20Dive.ipynb) notebook first, then hop over here when directed :)

In [1]:
import logging
import sys
import warnings
from typing import List, Literal, Tuple, Union

import pandas as pd
from langchain.chains import LLMChain
from langchain.globals import set_llm_cache
from langchain_core._api.deprecation import LangChainDeprecationWarning
from langchain_community.cache import InMemoryCache
from langchain_core.caches import InMemoryCache
from langchain_core.outputs import Generation
from langchain_core.output_parsers import JsonOutputParser
from langchain_openai.chat_models import ChatOpenAI
from langchain.prompts import (
    ChatPromptTemplate,
    HumanMessagePromptTemplate,
    SystemMessagePromptTemplate,
)
from langchain_openai.embeddings import OpenAIEmbeddings
from langchain_openai.chat_models import ChatOpenAI
from openai import APIConnectionError, RateLimitError

from utils import (
    augment_gold_labels,
    compute_sbert_metrics,
    compute_classifier_metrics,
    gold_label_report,
    preprocess_logits_for_metrics,
    to_dict,
    parse_match_address,
)

In [2]:
logging.basicConfig(stream=sys.stderr, level=logging.ERROR)

logger = logging.getLogger(__name__)

#### Squelch All `warnings`

[Langchain](https://python.langchain.com/v0.2/docs/introduction/) produces many deprecation warnings as its API is constantly improving. Let's squash them all!

In [3]:
# Suppress all warnings
warnings.filterwarnings("ignore")
warnings.filterwarnings("ignore", category=LangChainDeprecationWarning)

In [4]:
pd.set_option("display.max_rows", 40)

## Create a Dataset of Labeled Address Pairs for Training and Evaluation of Address Matchers

Before improving our precise structured address matcher or trying other approaches, let's create a dataset that enables a more rigorous test of address matching methods. I created two sets of pairs of addresses. Each pair is a `Tuple`, including a description and pair of addresses it describes.

1. `matched_address_pairs` - addresses that look different but that represent the same location. These are matched pairs.
2. `mistmatched_address_pairs` - addresses that look similar but represent different locations. The are mismatched pairs.

These are combined into a global `address_pairs` below.

In [5]:
matched_address_pairs: List[Tuple[str, str, str]] = [
    (
        "Different directional prefix formats for same address should match",
        "2024 NW 5th Ave, Miami, FL 33127",
        "2024 Northwest 5th Avenue, Miami, Florida 33127",
    ),
    (
        "Abbreviated street type for same address should match",
        "10200 NE 12th St, Bellevue, WA 98003",
        "10200 NE 12th Street, Bellevue, WA 98003",
    ),
    (
        "Common misspellings for same address should match",
        "1600 Pennsylvna Ave NW, Washington, DC 20500",
        "1600 Pennsylvania Avenue NW, Washington, DC 20500",
    ),
    (
        "Different directional prefix formats for same address should match",
        "550 S Hill St, Los Angeles, CA",
        "550 South Hill Street, Los Angeles, California",
    ),
    (
        "Incomplete address vs full address may match",
        "1020 SW 2nd Ave, Portland",
        "1020 SW 2nd Ave, Portland, OR 97204",
    ),
    (
        "Numerical variations for same address should match",
        "Third Ave, New York, NY",
        "3rd Avenue, New York, New York")
    ,
    (
        "Variant format of same address should match",
        "350 Fifth Avenue, New York, NY 10118",
        "Empire State Bldg, 350 5th Ave, NY, NY 10118",
    ),
    (
        "Variant format of same address should match",
        "Çırağan Caddesi No: 32, 34349 Beşiktaş, Istanbul, Turkey",
        "Ciragan Palace Hotel, Ciragan Street 32, Besiktas, Istanbul, TR",
    ),
    (
        "Different character sets for same address should match",
        "北京市朝阳区建国路88号",
        "Běijīng Shì Cháoyáng Qū Jiànguó Lù 88 Hào",
    ),
    (
        "Variant formats of same address should match",
        "上海市黄浦区南京东路318号",
        "上海黄浦南京东路318号"
    ),
    (
        "Variant formats of same address should match",
        "Shànghǎi Shì Huángpǔ Qū Nánjīng Dōng Lù 318 Hào",
        "Shànghǎi Huángpǔ Nánjīng Dōng Lù 318 Hào",
    ),
    (
        "Formal and localized format of same address should match",
        "B-14, Connaught Place, New Delhi, Delhi 110001, India",
        "B-14, CP, ND, DL 110001",
    ),
    (
        "Different character sets for same address should match",
        "16, MG Road, Bangalore, Karnataka 560001, India",
        "16, एमजी रोड, बैंगलोर, कर्नाटक 560001",
    ),
    (
        "Missing state but has postal code and country for same address should match",
        "Pariser Platz 2, 10117 Berlin, Germany",
        "Pariser Platz 2, 10117 Berlin, Berlin, Germany",
    ),
    (
        "Missing state but has postal code and country for same address should match",
        "Marienplatz 1, 80331 Munich, Germany",
        "Marienplatz 1, 80331 Munich, Bavaria, Germany"
    ),
    (
        "Abbreviated vs. full street names for same address should match",
        "123 Main St, Springfield, IL",
        "123 Main Street, Springfield, IL",
    ),
    (
        "Different languages for same address should match",
        "北京市东城区东长安街16号",
        "16 Dongchang'an St, Dongcheng, Beijing, China",
    ),
    (
        "Same address with and without country should match",
        "1600 Amphitheatre Parkway, Mountain View, CA 94043, USA",
        "1600 Amphitheatre Parkway, Mountain View, CA 94043",
    ),
    (
        "Same address with and without country should match",
        "3413 Sean Way, Lawrenceville, GA 30044, U.S.A.",
        "3413 Sean Way, Lawrenceville, Georgia, 30044",
    ),
    (
        "Different levels of detail for the same address may match",
        "221B Baker Street, London, NW1 6XE, UK",
        "221B Baker St, Marylebone, London NW1 6XE",
    ),
    (
        "Same address with and without district / neighborhood names should match",
        "1600 Amphitheatre Parkway, Mountain View, CA 94043, USA",
        "1600 Amphitheatre Parkway, Shoreline, Mountain View, CA 94043, USA",
    ),
    (
        "Including and excluding building names for same address should match",
        "The Empire State Building, 350 5th Ave, New York, NY 10118",
        "350 5th Ave, New York, NY 10118",
    ),
    (
        "Floor bumbers included or excluded for same address should match",
        "350 5th Ave, 86th Floor, New York, NY 10118",
        "350 5th Ave, New York, NY 10118",
    ),
    (
        "Same address incorporates business name or not should match",
        "Google, 1600 Amphitheatre Parkway, Mountain View, CA 94043",
        "1600 Amphitheatre Parkway, Mountain View, CA 94043",
    ),
    (
        "Intersection vs addresss for same location should match",
        "1600 Amphitheatre Parkway at Charleston Road, Mountain View, CA 94043",
        "1600 Amphitheatre Parkway, Mountain View, CA 94043",
    ),
    (
        "Local vs. international formatting for same address should match",
        "221B Baker Street, London, NW1 6XE, UK",
        "221B Baker Street, Marylebone, London, NW1 6XE, United Kingdom",
    ),
    (
        "Addition of parenthetical details for same address should match",
        "Building 4 (East Wing), 123 Tech Park, Silicon Valley, CA 94301",
        "Building 4, 123 Tech Park, Silicon Valley, CA 94301",
    ),
    (
        "Synonyms for street types for same address should match",
        "456 Elm St, Springfield, IL 62704",
        "456 Elm Street, Springfield, IL 62704",
    ),
    (
        "Different language versions of same address should match",
        "16 Rue de la Paix, 75002 Paris, France",
        "16 Peace Street, 75002 Paris, France",
    ),
    (
        "Different terms for unit number for same address should match",
        "500 Fifth Avenue, Apt. 20, New York, NY 10110",
        "500 Fifth Avenue, Suite 20, New York, NY 10110",
    ),
    (
        "Including a business name or not, in same address should match",
        "123 Main St, Springfield, IL",
        "Company ABC, 123 Main St, Springfield, IL",
    ),
    (
        "Typographical errors in street name of same address should match",
        "1600 Amphitheatre Parkway, Mountain View, CA",
        "1600 Amptheatre Parkway, Mountain View, CA",
    ),
    (
        "Typographical errors in same address with country should match",
        "Calle Mayor, 10, 28013 Madrid, España",
        "Calle Mayor, 10, 28013 Madird, España",
    ),
    (
        "Typographical errors in city of same address should match",
        "16 Rue de la Paix, 75002 Paris, France",
        "16 Rue de la Paix, 75002 Pariss, France",
    ),
    (
        "Typographical errors in city of same address should match",
        "Alexanderplatz 1, 10178 Berlin, Deutschland",
        "Alexanderplatz 1, 10178 Berin, Deutschland",
    ),
    (
        "Common typographical errors in same address should match",
        "北京市东城区东长安街1号, 中国",
        "北京市东城区东长安街1号, 中囯",
    ),
    (
        "Numeric or written street number for same address should match",
        "123 4th St, Springfield, IL",
        "123 Fourth St, Springfield, IL",
    ),
    (
        "Punctuation or not in abbreviations for same address should match",
        "10350 NE 12th St, Bellevue, WA 98003",
        "10350 N.E. 12th St., Bellevue, WA 98003",
    ),
    (
        "Normal vs formal country names for same address should match",
        "456 Coastal Lane, Benaulim, Goa, 403716, India",
        "456 Coastal Lane, Benaulim, Goa, 403716, Republic of India",
    ),
    (
        "Normal vs abbreviated country name for same address should match",
        "456 Coastal Lane, Benaulim, Goa, 403716, India",
        "456 Coastal Lane, Benaulim, Goa, 403716, IN",
    ),
    (
        "Missing country in one record can match",
        "3413 Sean Way, Lawrenceville, GA 30044",
        "3413 Sean Way, Lawrenceville, GA 30044, USA",
    ),
    (
        "Addresses that match are often missing countries",
        "Rua de Santa Catarina, 432, 4000-446 Porto, Portugal",
        "Rua de Santa Catarina, 432, 4000-446 Porto",
    ),
    (
        "A complete address without a country that otherwise matches an address with a country should match",
        "ul. Krakowska 1500-956 Warszawa, Poland",
        "ul. Marszałkowska 1500-956 Warszawa",
    ),
]

In [6]:
mismatched_address_pairs: List[Tuple[str, str, str]] = [
    (
        "Different street numbers means different address",
        "101 Oak Lane, Marietta, GA 30008",
        "102 Oak Lane, Marietta, GA 30008",
    ),
    (
        "Different street names means different address",
        "101 Market Square, Seattle, WA 98039",
        "101 Davis Place, Seattle, WA 98039",
    ),
    (
        "Different street name endings means different address",
        "100 Oak Lane, Atlanta, GA 30306",
        "100 Oak Place, Atlanta, GA 30306",
    ),
    (
        "Different cities means different address",
        "2754 Ralph McGill Blvd, Atlanta, GA",
        "2754 Ralph McGill Blvd, Macon, GA",
    ),
    (
        "Different states means different address",
        "361 Oakhurst Ave., Rome, GA 30149",
        "361 Oakhurst Ave., Rome, NY, 13308",
    ),
    (
        "Different postal codes means different address",
        "76 Providence St, Providence, RI, 02860",
        "76 Providence St, Providence, RI, 02861",
    ),
    (
        "Similar cities in different states, postal codes or countries means different address",
        "100 Main Street, Springfield, IL 62701",
        "100 Main Street, Springfield, MA 01103",
    ),
    (
        "Similar street names with different directions means different address",
        "200 1st Ave, Seattle, WA 98109",
        "200 1st Ave N, Seattle, WA 98109",
    ),
    (
        "Adjacent or nearby building numbers means different address",
        "4800 Oak Street, Kansas City, MO 64112",
        "4800 W Oak Street, Kansas City, MO 64112",
    ),
    (
        "Similar international locations in different countries means different address",
        "33 Queen Street, Auckland 1010, New Zealand",
        "33 Queen Street, Brisbane QLD 4000, Australia",
    ),
    (
        "Close numerical variants are different addresses",
        "75 West 50th Street, New York, NY 10112",
        "50 West 75th Street, New York, NY 10023",
    ),
    (
        "Similar road names can be different addresses",
        "北京市朝阳区朝阳门外大街6号",
        "北京市朝阳区朝阳门内大街6号"
    ),
    (
        "Similar road names can be different addresses",
        "Běijīng Shì Cháoyáng Qū Cháoyángmén Wài Dàjiē 6 Hào",
        "Běijīng Shì Cháoyáng Qū Cháoyángmén Nèi Dàjiē 6 Hào",
    ),
    (
        "Similar building names can be different addresses",
        "上海市徐汇区中山西路200号",
        "上海市长宁区中山西路200号",
    ),
    (
        "Similar but different building names means different address",
        "Shànghǎi Shì Xúhuì Qū Zhōngshān Xī Lù 200 Hào",
        "Shànghǎi Shì Chángníng Qū Zhōngshān Xī Lù 200 Hào",
    ),
    (
        "Different unit numbers means different address",
        "27 Peachtree St, Apt 101, Atlanta, GA 30307",
        "27 Peachtree St, Apt 1213, Atlanta, GA 30307",
    ),
    (
        "Missing unit number in match means different address",
        "27 Peachtree St., Apt 101, Atlanta, GA 30308",
        "27 Peachtree St., Atlanta, GA 30308",
    ),
    (
        "Missing street suffix can mean different address",
        "1020 SW 2nd, Portland, OR 97204",
        "1020 SW 2nd Ave, Portland, OR 97204",
    ),
    (
        "Missing postal code can mean different address",
        "Bouillon Racine: 3, rue Racine, 75006 Paris",
        "Bouillon Racine: 3, rue Racine, Paris",
    ),
    (
        "Different postal codes means different address",
        "1 Infinite Loop, Cupertino, CA 95014",
        "1 Infinite Loop, Cupertino, CA 95015",
    ),
    (
        "Different units in a building means different address",
        "500 Fifth Avenue, Apt. 2A, New York, NY 10110",
        "500 Fifth Avenue, Apt. 2-B, New York, NY 10110",
    ),
    (
        "Street type variations means different address",
        "456 Elm St, Springfield, IL 62704",
        "456 Elm Rd, Springfield, IL 62704",
    ),
    (
        "Different street suffixes means different address",
        "123 Main St, Springfield, IL",
        "123 Main Ave, Springfield, IL",
    ),
    (
        "Different states means different address",
        "Alexanderstraße 7, 10178 Berlin, Germany",
        "Alexanderstraße 7, 20099 Hamburg, Germany",
    ),
    (
        "Different states means different address",
        "200 George St, Sydney, NSW 2000, Australia",
        "200 George St, Melbourne, VIC 3000, Australia",
    ),
    (
        "Different states means different address",
        "100 King St W, Toronto, ON M5X 1A9, Canada",
        "100 King St W, Vancouver, BC V6B 1H8, Canada",
    ),
    (
        "Different street numbers means different address",
        "Unter den Linden 4, 10117 Berlin, Germany",
        "Unter den Linden 5, 10117 Berlin, Germany",
    ),
    (
        "Different street numbers means different address",
        "Avenida Paulista 1000, Bela Vista, São Paulo - SP, 01310-100",
        "Avenida Paulista 200, Bela Vista, São Paulo - SP, 01310-100",
    ),
    (
        "Different countries means different address",
        "123 Main Street, Vancouver, BC V5K 0A1, Canada",
        "123 Main Street, Vancouver, WA 98660, USA",
    ),
    # Some widely diverging examples
    (
        "Completely different addresses that don't match",
        "110 Sejong-daero, Jung-gu, Seoul, South Korea",
        "Avenue Colonel Mondjiba 372, Kinshasa, Gombe, Democratic Republic of the Congo",
    ),
    (
        "Different addresses in the same country that don't match",
        "1234 Manor Plaza, Pacifica, CA 94044",
        "1234 Bly Manor, Pacific Heights, WA 98003",
    ),
    (
        "Different addresses in the same country that don't match",
        "350 5th Ave, New York, NY 10118",
        "1350 El Prado, San Diego, CA 92101",
    ),
    (
        "Completely different addresses that don't match",
        "Rue de la Loi 175, 1040 Brussels",
        "1 Macquarie Street, Sydney, NSW 2000",
    ),
    (
        "Different street names means different address",
        "market Square",
        "davis Place",
    ),
    (
        "Similar but different street numbers",
        "10101 Tensor St.",
        "11010 Tensor St.",
    ),
    (
        "Different address field positions of a pair of the same numbers in unmatched addresses",
        "75 50th St., Brooklyn, NY 11232",
        "50 75th St., Brooklyn, NY 11209",
    ),
    (
        "Different address field positions of a pair of the same numbers in unmatched addresses",
        "74 49th Ave, Long Island City, NY 11232",
        "49 74th Ave, Long Island City, NY 11232",
    ),
    (
        "Different street names with the same house number are different addresses",
        "555 Oak Court, Seattle WA 98039",
        "555 Maple Lane, Seattle, WA 98039",
    ),
    (
        "Different street names with the same house number are different addresses",
        "101 Post Street",
        "101 Oak Street",
    ),
    (
        "Different street names with the same house number are different addresses",
        "ul. Krakowska 1500-956 Warszawa",
        "ul. Marszałkowska 1500-956 Warszawa",
    ),
]

In [7]:
address_pairs = matched_address_pairs + mismatched_address_pairs

print(f"Matched label address pairs: {len(matched_address_pairs):,}")
print(f"Mismatched label address pairs: {len(mismatched_address_pairs):,}")
print(f"Total label address pairs: {len(address_pairs):,}")

Matched label address pairs: 43
Mismatched label address pairs: 40
Total label address pairs: 83


### Create a `pandas.DataFrame` of Hand Labeled Records

We create separate `pd.DataFrames`, `match_df` and `mismatch_df`, to set their labels as 1 or 0. Then we combine them into `combined_df`.

In [8]:
match_df = pd.DataFrame(matched_address_pairs, columns=["Description", "Address1", "Address2"])
match_df["Label"] = 1

match_df.head(40)

Unnamed: 0,Description,Address1,Address2,Label
0,Different directional prefix formats for same ...,"2024 NW 5th Ave, Miami, FL 33127","2024 Northwest 5th Avenue, Miami, Florida 33127",1
1,Abbreviated street type for same address shoul...,"10200 NE 12th St, Bellevue, WA 98003","10200 NE 12th Street, Bellevue, WA 98003",1
2,Common misspellings for same address should match,"1600 Pennsylvna Ave NW, Washington, DC 20500","1600 Pennsylvania Avenue NW, Washington, DC 20500",1
3,Different directional prefix formats for same ...,"550 S Hill St, Los Angeles, CA","550 South Hill Street, Los Angeles, California",1
4,Incomplete address vs full address may match,"1020 SW 2nd Ave, Portland","1020 SW 2nd Ave, Portland, OR 97204",1
5,Numerical variations for same address should m...,"Third Ave, New York, NY","3rd Avenue, New York, New York",1
6,Variant format of same address should match,"350 Fifth Avenue, New York, NY 10118","Empire State Bldg, 350 5th Ave, NY, NY 10118",1
7,Variant format of same address should match,"Çırağan Caddesi No: 32, 34349 Beşiktaş, Istanb...","Ciragan Palace Hotel, Ciragan Street 32, Besik...",1
8,Different character sets for same address shou...,北京市朝阳区建国路88号,Běijīng Shì Cháoyáng Qū Jiànguó Lù 88 Hào,1
9,Variant formats of same address should match,上海市黄浦区南京东路318号,上海黄浦南京东路318号,1


In [9]:
mismatch_df = pd.DataFrame(mismatched_address_pairs, columns=["Description", "Address1", "Address2"])
mismatch_df["Label"] = 0

mismatch_df.head(40)

Unnamed: 0,Description,Address1,Address2,Label
0,Different street numbers means different address,"101 Oak Lane, Marietta, GA 30008","102 Oak Lane, Marietta, GA 30008",0
1,Different street names means different address,"101 Market Square, Seattle, WA 98039","101 Davis Place, Seattle, WA 98039",0
2,Different street name endings means different ...,"100 Oak Lane, Atlanta, GA 30306","100 Oak Place, Atlanta, GA 30306",0
3,Different cities means different address,"2754 Ralph McGill Blvd, Atlanta, GA","2754 Ralph McGill Blvd, Macon, GA",0
4,Different states means different address,"361 Oakhurst Ave., Rome, GA 30149","361 Oakhurst Ave., Rome, NY, 13308",0
5,Different postal codes means different address,"76 Providence St, Providence, RI, 02860","76 Providence St, Providence, RI, 02861",0
6,"Similar cities in different states, postal cod...","100 Main Street, Springfield, IL 62701","100 Main Street, Springfield, MA 01103",0
7,Similar street names with different directions...,"200 1st Ave, Seattle, WA 98109","200 1st Ave N, Seattle, WA 98109",0
8,Adjacent or nearby building numbers means diff...,"4800 Oak Street, Kansas City, MO 64112","4800 W Oak Street, Kansas City, MO 64112",0
9,Similar international locations in different c...,"33 Queen Street, Auckland 1010, New Zealand","33 Queen Street, Brisbane QLD 4000, Australia",0


### Establish a Gold Labeled Dataset

These are records we have hand-labeled and will use to score our matching models and algorithms.

In [10]:
gold_df = pd.concat([match_df, mismatch_df], ignore_index=True)

gold_df.head(40)

Unnamed: 0,Description,Address1,Address2,Label
0,Different directional prefix formats for same ...,"2024 NW 5th Ave, Miami, FL 33127","2024 Northwest 5th Avenue, Miami, Florida 33127",1
1,Abbreviated street type for same address shoul...,"10200 NE 12th St, Bellevue, WA 98003","10200 NE 12th Street, Bellevue, WA 98003",1
2,Common misspellings for same address should match,"1600 Pennsylvna Ave NW, Washington, DC 20500","1600 Pennsylvania Avenue NW, Washington, DC 20500",1
3,Different directional prefix formats for same ...,"550 S Hill St, Los Angeles, CA","550 South Hill Street, Los Angeles, California",1
4,Incomplete address vs full address may match,"1020 SW 2nd Ave, Portland","1020 SW 2nd Ave, Portland, OR 97204",1
5,Numerical variations for same address should m...,"Third Ave, New York, NY","3rd Avenue, New York, New York",1
6,Variant format of same address should match,"350 Fifth Avenue, New York, NY 10118","Empire State Bldg, 350 5th Ave, NY, NY 10118",1
7,Variant format of same address should match,"Çırağan Caddesi No: 32, 34349 Beşiktaş, Istanb...","Ciragan Palace Hotel, Ciragan Street 32, Besik...",1
8,Different character sets for same address shou...,北京市朝阳区建国路88号,Běijīng Shì Cháoyáng Qū Jiànguó Lù 88 Hào,1
9,Variant formats of same address should match,上海市黄浦区南京东路318号,上海黄浦南京东路318号,1


### Save our Gold Labels

In [11]:
gold_df.to_csv("data/gold.csv", header=True)

## Data Augmentation with GPT4o - Multiplying Training Data

We only have 64 training records, but they aren't just random examples. They cover a range of corner cases that should give a model trained via supervised learning clues about the semantics of addresses.

### Human in the Loop Fine-Tuning

We may have enough diversity of examples to generate sufficient training data to cover most addresses (at least North American ones). To improve performance, we can create new hand-labeled examples that fix the errors in `match_df` and `mismatch_df` and re-run the data augmentation pipeline to then re-train and evaluate our matching models. **This gives product managers of AI products a means of product managing their models' predictions.**

### Not Enough Data: Data Augmentation to the Rescue!

Hand labeling is a slow way to create labeled datasets. We need a lot more data than we have, and I can't spare the time to label thousands of records. We could use "mechanical turks" to do this work, but instead we're going to use a data augmentation strategy to use Large Language Models (LLMs) to create semantically similar duplicates for each of our original 29 labeled pairs. The address pair descriptions, along with the match/mis-match label, will guide the LLM in creating semantically similar labeled address pairs. 

### OpenAI GPT4o for Data Augmentation

We are going to use the OpenAI API for GPT4o to ask the model to generate similar records for each record we show it that match the semantics of the description of each pair. This can multiply the number of records by 10, 100 even 1,000 times.

We set the temporature of the LLM high to ensure we get a diverse set of records. 100 records is as many as we can request at once, so we're going to loop 5 times to get approximately 10,000 labeled address pairs.

### How many Record Clones per Existing Example?

For each hand-labeled record, we execute 1 request for 100 records each - giving us about 5,000 examples for each hand-labeled address pair.

In [12]:
CLONES_PER_RUN = 100
RUNS_PER_EXAMPLE = 2

# Append clones per run and runs per example as columns
gold_df["Clones"] = CLONES_PER_RUN
gold_df["Runs"] = RUNS_PER_EXAMPLE

In [13]:
gold_df.head(len(gold_df))

Unnamed: 0,Description,Address1,Address2,Label,Clones,Runs
0,Different directional prefix formats for same ...,"2024 NW 5th Ave, Miami, FL 33127","2024 Northwest 5th Avenue, Miami, Florida 33127",1,100,2
1,Abbreviated street type for same address shoul...,"10200 NE 12th St, Bellevue, WA 98003","10200 NE 12th Street, Bellevue, WA 98003",1,100,2
2,Common misspellings for same address should match,"1600 Pennsylvna Ave NW, Washington, DC 20500","1600 Pennsylvania Avenue NW, Washington, DC 20500",1,100,2
3,Different directional prefix formats for same ...,"550 S Hill St, Los Angeles, CA","550 South Hill Street, Los Angeles, California",1,100,2
4,Incomplete address vs full address may match,"1020 SW 2nd Ave, Portland","1020 SW 2nd Ave, Portland, OR 97204",1,100,2
...,...,...,...,...,...,...
78,Different address field positions of a pair of...,"75 50th St., Brooklyn, NY 11232","50 75th St., Brooklyn, NY 11209",0,100,2
79,Different address field positions of a pair of...,"74 49th Ave, Long Island City, NY 11232","49 74th Ave, Long Island City, NY 11232",0,100,2
80,Different street names with the same house num...,"555 Oak Court, Seattle WA 98039","555 Maple Lane, Seattle, WA 98039",0,100,2
81,Different street names with the same house num...,101 Post Street,101 Oak Street,0,100,2


## Gold Label Evaluation

Let's define a method to evaluate our models on the original records, rather than on a sample of the augmented records we will generate below. This will let us know in certain terms how different methods of address matching peform.

### Raw Report and Grouped Report

We use a utility function in [utils.py](utils.py) called `gold_label_report` that will apply a list of matching methods to our gold label data and return the raw results and a categorical summary. One DataFrame `raw_df` will contain each address pair, while another `grouped_df` will group them by their `Description` field to better understand each model's performance.

In [14]:
def strict_parse_match(row: pd.Series) -> pd.Series:
    """strict_parse_match Strict address matching"""
    return parse_match_address(row["Address1"], row["Address2"])

In [15]:
# Get raw results and accuracy by type of matching
raw_df, grouped_df = gold_label_report(gold_df, [strict_parse_match])

### Gold Label Report by `Description`

Let's evaluate how well our address matchers work by category.

In [16]:
# Show what it knows, followed by what it don't, in alphabetical order
grouped_df.sort_values(by="strict_parse_match_acc", ascending=False).head(40)

Unnamed: 0_level_0,strict_parse_match_acc
Description,Unnamed: 1_level_1
Including and excluding building names for same address should match,1.0
Typographical errors in city of same address should match,1.0
Same address with and without country should match,1.0
Missing state but has postal code and country for same address should match,1.0
Missing country in one record can match,1.0
Floor bumbers included or excluded for same address should match,1.0
Same address incorporates business name or not should match,1.0
Typographical errors in same address with country should match,1.0
Addresses that match are often missing countries,1.0
Punctuation or not in abbreviations for same address should match,0.0


### Strict Matching Results

You can see that strict matching only works for our gold labeled records under certain circumstances, such as when values not essential for a strict match vary. We will improve upon these results below!

### Gold Label Report

Here we can view each example, including what we got right and what we got wrong. This can lead to iterative improvements.

In [17]:
true_df = raw_df[raw_df["strict_parse_match_correct"]].reset_index(drop=True)
print(f"Total accurate matches for strict_parse_match: {len(true_df):,}")

true_df.head(40)

Total accurate matches for strict_parse_match: 52


Unnamed: 0,Description,Address1,Address2,Label,strict_parse_match,strict_parse_match_correct
0,Missing state but has postal code and country ...,"Pariser Platz 2, 10117 Berlin, Germany","Pariser Platz 2, 10117 Berlin, Berlin, Germany",1,1,True
1,Missing state but has postal code and country ...,"Marienplatz 1, 80331 Munich, Germany","Marienplatz 1, 80331 Munich, Bavaria, Germany",1,1,True
2,Same address with and without country should m...,"1600 Amphitheatre Parkway, Mountain View, CA 9...","1600 Amphitheatre Parkway, Mountain View, CA 9...",1,1,True
3,Same address with and without country should m...,"3413 Sean Way, Lawrenceville, GA 30044, U.S.A.","3413 Sean Way, Lawrenceville, Georgia, 30044",1,1,True
4,Including and excluding building names for sam...,"The Empire State Building, 350 5th Ave, New Yo...","350 5th Ave, New York, NY 10118",1,1,True
5,Floor bumbers included or excluded for same ad...,"350 5th Ave, 86th Floor, New York, NY 10118","350 5th Ave, New York, NY 10118",1,1,True
6,Same address incorporates business name or not...,"Google, 1600 Amphitheatre Parkway, Mountain Vi...","1600 Amphitheatre Parkway, Mountain View, CA 9...",1,1,True
7,Typographical errors in same address with coun...,"Calle Mayor, 10, 28013 Madrid, España","Calle Mayor, 10, 28013 Madird, España",1,1,True
8,Typographical errors in city of same address s...,"16 Rue de la Paix, 75002 Paris, France","16 Rue de la Paix, 75002 Pariss, France",1,1,True
9,Typographical errors in city of same address s...,"Alexanderplatz 1, 10178 Berlin, Deutschland","Alexanderplatz 1, 10178 Berin, Deutschland",1,1,True


In [18]:
false_df = raw_df[raw_df["strict_parse_match_correct"] == False].reset_index(drop=True)
print(f"Total mismatches for strict_parse_match: {len(false_df):,}")

false_df.head(40)

Total mismatches for strict_parse_match: 31


Unnamed: 0,Description,Address1,Address2,Label,strict_parse_match,strict_parse_match_correct
0,Different directional prefix formats for same ...,"2024 NW 5th Ave, Miami, FL 33127","2024 Northwest 5th Avenue, Miami, Florida 33127",1,0,False
1,Abbreviated street type for same address shoul...,"10200 NE 12th St, Bellevue, WA 98003","10200 NE 12th Street, Bellevue, WA 98003",1,0,False
2,Common misspellings for same address should match,"1600 Pennsylvna Ave NW, Washington, DC 20500","1600 Pennsylvania Avenue NW, Washington, DC 20500",1,0,False
3,Different directional prefix formats for same ...,"550 S Hill St, Los Angeles, CA","550 South Hill Street, Los Angeles, California",1,0,False
4,Incomplete address vs full address may match,"1020 SW 2nd Ave, Portland","1020 SW 2nd Ave, Portland, OR 97204",1,0,False
5,Numerical variations for same address should m...,"Third Ave, New York, NY","3rd Avenue, New York, New York",1,0,False
6,Variant format of same address should match,"350 Fifth Avenue, New York, NY 10118","Empire State Bldg, 350 5th Ave, NY, NY 10118",1,0,False
7,Variant format of same address should match,"Çırağan Caddesi No: 32, 34349 Beşiktaş, Istanb...","Ciragan Palace Hotel, Ciragan Street 32, Besik...",1,0,False
8,Different character sets for same address shou...,北京市朝阳区建国路88号,Běijīng Shì Cháoyáng Qū Jiànguó Lù 88 Hào,1,0,False
9,Variant formats of same address should match,上海市黄浦区南京东路318号,上海黄浦南京东路318号,1,0,False


### Setup LLM Caching: Sometimes

Our OpenAI LLM request generates a lot of data when asking for a 100 JSON record array. This causes it to occasionally timeout. Accordingly, we setup in memory caching for its requests, so if there is an exception in the request loop for our training examples below, we can simply re-run the cell and it will rapidly return to where it failed and retry, without having to run the previous requests over again.

**NOTE: If we iterate and make multiple LLM calls per record via `RUNS_PER_EXAMPLE`, we can't use LLM caching because it will give us the cached result for all iterative API calls after the first one.**

In [19]:
# Set LLM caching up front
# set_llm_cache(InMemoryCache())

In [20]:
# High temperature intended to ensure a lot of variety in labeled records
llm = ChatOpenAI(model="gpt-4o", temperature=0.5)

In [21]:
messages: List[Union[SystemMessagePromptTemplate, HumanMessagePromptTemplate]] = [
    SystemMessagePromptTemplate.from_template(
        "I need your help with a data science, data augmentation task. I am fine-tuning "
        "a sentence transformer paraphrase model to match pairs of addresses using text embeddings. "
        "I tried several embedding models and none of them perform well. They need fine-tuning "
        "for this task. I have created 76 example pairs of addresses to serve as gold label training "
        "data for fine-tuning a SentenceTransformer model. Each record has the fields "
        "Address1, Address2, a Description of the semantic the pair of addresses express "
        "(ex. 'different street number means different address') and a Label (1 for positive match, 0 for negative)."
        "\n\n"
        "The tasks cover two categories of corner cases or difficult tasks. The first is when similar "
        "addresses in string distance aren't the same, thus have label 0.\n"
        "The second is the opposite: when dissimilar addresses in string distance are the same, "
        "thus have label 1. The strings you return for Address1 and Address2 should not be literally "
        "the same.\n\n"
        "Your task is to read a pair of Addresses, their Description and their Label and generate {Clones} "
        "different new examples that express a similar semantic. Your job is to create variations "
        "of these records that satisfy the semantic expressed in the description but cover "
        "widely varying cases of the meaning covering addresses in countries all over the world. "
        "Do not literally copy the address components. Think methodically, logically and create variations. "
        "Use what you know about postal addresses to accomplish this work."
        "\n\n"
        "You should return the result in a valid JSON array of records and nothing else, using the "
        "fields Address1, Address2, Description and Label."
    ),
    # MessagesPlaceholder(variable_name="chat_history"),
    HumanMessagePromptTemplate.from_template(
        "Please generate {Clones} different examples in a JSON list that express the same or similar semantic as "
        "the pair of addresses below based on its Descripton, Label and the Address pairs.\n\n"
        "Address 1: {Address1}\n"
        + "Address 2: {Address2}\n"
        + "Description: {Description}\n"
        + "Label: {Label}\n"
    ),
]
prompt = ChatPromptTemplate.from_messages(messages=messages)

# Everything look alright?
print(
    prompt.format(
        Clones=CLONES_PER_RUN,
        Address1=gold_df.iloc[0]["Address1"],
        Address2=gold_df.iloc[0]["Address2"],
        Description=gold_df.iloc[0]["Description"],
        Label=gold_df.iloc[0]["Label"],
    )
)

System: I need your help with a data science, data augmentation task. I am fine-tuning a sentence transformer paraphrase model to match pairs of addresses using text embeddings. I tried several embedding models and none of them perform well. They need fine-tuning for this task. I have created 76 example pairs of addresses to serve as gold label training data for fine-tuning a SentenceTransformer model. Each record has the fields Address1, Address2, a Description of the semantic the pair of addresses express (ex. 'different street number means different address') and a Label (1 for positive match, 0 for negative).

The tasks cover two categories of corner cases or difficult tasks. The first is when similar addresses in string distance aren't the same, thus have label 0.
The second is the opposite: when dissimilar addresses in string distance are the same, thus have label 1. The strings you return for Address1 and Address2 should not be literally the same.

Your task is to read a pair of

In [22]:
json_output_parser = JsonOutputParser()

label_chain = LLMChain(
    name="label_chain", prompt=prompt, llm=llm, output_parser=json_output_parser, verbose=True
)

# Test it once...
TEST_INDEX = 0

result = label_chain.run(**gold_df.iloc[TEST_INDEX].to_dict())
print(result)



[1m> Entering new label_chain chain...[0m
Prompt after formatting:
[32;1m[1;3mSystem: I need your help with a data science, data augmentation task. I am fine-tuning a sentence transformer paraphrase model to match pairs of addresses using text embeddings. I tried several embedding models and none of them perform well. They need fine-tuning for this task. I have created 76 example pairs of addresses to serve as gold label training data for fine-tuning a SentenceTransformer model. Each record has the fields Address1, Address2, a Description of the semantic the pair of addresses express (ex. 'different street number means different address') and a Label (1 for positive match, 0 for negative).

The tasks cover two categories of corner cases or difficult tasks. The first is when similar addresses in string distance aren't the same, thus have label 0.
The second is the opposite: when dissimilar addresses in string distance are the same, thus have label 1. The strings you return for Add

### Run Examples through an LLM to Generate Records to Fine-Tune `SentenceTransformers` and other Models

Now that we know our chain works, let's generate some training data! I have created a helper function called `augment_gold_labels` that iterates through our gold labeled data and submits them for augmentation as we have seen above. You can find it in [utils.py](utils.py).

In [None]:
augment_results_df = augment_gold_labels(gold_df, runs_per_example=RUNS_PER_EXAMPLE)

Starting 166 API calls, 2x for each of 83 hand-labeled records.


OpenAI API Calls:   0%|          | 0/166 [00:00<?, ?it/s]



[1m> Entering new label_chain chain...[0m
Prompt after formatting:
[32;1m[1;3mSystem: I need your help with a data science, data augmentation task. I am fine-tuning a sentence transformer paraphrase model to match pairs of addresses. I tried several embedding models and none of them perform well. They need fine-tuning for this task. I have created about 100 example pairs of addresses to serve as training data for fine-tuning a SentenceTransformer model. Each record has the fields Address1, Address2, a Description of the semantic they express (ex. 'different street number') and a Label (1 for positive match, 0 for negative).

The tasks cover two categories of corner cases or difficult tasks. The first is when similar addresses in string distance aren't the same, thus have label 0. The second is the opposite: when dissimilar addresses in string distance are the same, thus have label 1. The strings you return for Address1 and Address2 should not be literally the same.

Your task is 

In [None]:
# Save for the other notebook Address Matching Deep Dive
augment_results_df.to_parquet("data/training.8.parquet")

In [None]:
augment_results_df.head()

In [None]:
# Shuffle our results so we can see different examples - remember we shuffle again in our train_test_split
augment_results_df = augment_results_df.sample(frac=1.0).reset_index(drop=True)
augment_results_df.head(50)

### Sanity Check our Descriptions

At one point the model generated 50K examples of one type because my iteration on the LLM instructions went haywire. Let's make sure we have a variety of corner cases in our data. Depending on the prompt, it is possible for the jobs we submitted to return surprising data :)

In [None]:
augment_results_df.groupby("Description").count()["Label"].head(40)

## Complete!

Data augmentation is now complete, you can find our new dataset in [`data/training.7.parquet`](data/training.7.parquet). Now return to the [Address Matching Deep Dive.ipynb](Address%20Matching%20Deep%20Dive.ipynb) notebook.