# Merging Tables Based on "Fuzzy" Matches

## Step 1: Trying to Find a Match

In [5]:
import pandas as pd

In [6]:
agent_file = '/Users/haru/Documents/data/sofla/licensing/sofla_agent-lists_elliman.csv'
license_file = '/Users/haru/Documents/data/sofla/licensing/sofla_licensing_dade_just-addresses_2018-04-28.csv'

In [7]:
agentFrame = pd.read_csv(agent_file, dtype=str)
licenseFrame = pd.read_csv(license_file, dtype=str)

Our goal here is to take the place-of-work addresses contained in the list of agents we downloaded from the Douglass Elliman website and merge it with the residence addresses for those same agents in the licensing file.

But because agents often do business under a slightly different name from what is on their real estate license, it helps to use a 'fuzzy' matching strategy.

To do that, we want to break the names on the agent list apart and use each piece to filter down the licensing list.

We've already split the agent file into first and last names. Let's see what comes up for Joshua Ackerman, an agent listed on Elliman's website.

First, we'll search for Ackerman (in upper case).

In [11]:
mask = licenseFrame['Licensee Name'].str.contains('ACKERMAN', na=False)
group = licenseFrame.loc[mask]
print(group['Licensee Name'].values)

['ACKERMAN, JOE SAM' 'ACKERMAN, ARMINDA' 'ACKERMAN, MATTHEW D'
 'ACKERMAN, JOSHUA ELAN']


There is only one Joshua Ackerman in the licensing roll, but it's not a perfect match. The license includes a middle name, 'Elan.'

Now that we've filtered down by last name, we can make a second pass based on the first name: a filter within a filter.

In [13]:
mask = group['Licensee Name'].str.contains('JOSHUA', na=False)
group = group.loc[mask]
print(group['Licensee Name'].values)

['ACKERMAN, JOSHUA ELAN']


That's a good match. But since it's possible for there two be two real estate agents in the same county under the same name, it would help to have supporting evidence that these are the same person.

A license number would be ideal, but Douglas Elliman doesn't provide that on their site.

Instead, we can look at the licensing roll to see which firm is listed as Joshua Ackerman's employer. While it's possible that there are two Joshua Ackermans in Miami real estate, the chances that they both would work at Douglas Elliman is pretty low.

In [14]:
print(group["Employer's Name"].values)

['DOUGLAS ELLIMAN FLORIDA LLC']


That's good enough for us!

But now that we've found a match, we still to figure out a way to merge the information from the two tables. To do that, we need to deal with the difference between their licensed name and their public one.

We could clean the licensing data, but that would take a lot of time for very little benefit. The purpose of cleaning the licensed names would be to ensure a good match between the licensing file and the agent file—but we've already made the match using boolean masks. Cleaning the names is, therefore, unnecessary.

We can merge the data, then, with a simple call to pd.concat().

## Step 2: Merging the Data

We can define a function that will complete the following tasks:
    1. Use boolean masks (i.e. "filters") to identify a match between an agent's public name and their licensed name.
    2. 