# 4. Record linkage
**Record linkage is a powerful technique used to merge multiple datasets together, used when values have typos or different spellings. In this chapter, you'll learn how to link records by calculating the similarity between strings—you’ll then use your new skills to join two restaurant review datasets into one clean master dataset.**

## Comparing strings
Welcome to the final chapter of this course, where we'll discover the world of record linkage. But before we get deep dive into record linkage, let's sharpen our understanding of string similarity and minimum edit distance.

### Minimum edit distance
Minimum edit distance is a systematic way to identify how close 2 strings are.

For example, let's take a look at the following two words: *intention*, and *execution*.

The minimum edit distance between them is the least possible amount of steps, that could get us from the word intention to execution, with the available operations being inserting new characters, deleting them, substituting them, and transposing consecutive characters.

To get from *intention* to *execution*, We first start off by deleting `I` from *intention*, and adding `C` between `E` and `N`. Our minimum edit distance so far is **2**, since these are two operations. Then we substitute the first `N` with `E`, `T` with `X`, and `N` with `U`, leading us to execution! With the minimum edit distance being **5**.

The lower the edit distance, the closer two words are. For example, the two different typos of *reading*, '*reeding*' and '*redaing*', have a minimum edit distance of **1** between them and reading.

### Minimum edit distance algorithm
There's a variety of algorithms based on edit distance that differ on which operations they use, how much weight attributed to each operation, which type of strings they're suited for and more, with a variety of packages to get each similarity.

For this lesson, we'll be comparing strings using Levenshtein distance since it's the most general form of string matching by using the fuzzywuzzy package.

Algorithm | Operations
:---|:---
Danerau-Levenshtein | insertion, substitution, deletion, transposition
***Levenshtein*** | ***insertion, substitution, deletion***
Hamming | substitution only
Jaro distance | transpostion only
... | ...

**possible packages**: `nltk`, `fuzzywuzzy`, `textdistance`, ...

### Simple string comparison
***Fuzzywuzzy*** is a simple to use package to perform string comparison. We first import fuzz from fuzzywuzzy, which allow us to compare between single strings. 

Here we use fuzz's WRatio function to compute the similarity between reading and its typo, inputting each string as an argument. For any comparison function using fuzzywuzzy, our output is a score from 0 to 100 with 0 being not similar at all, 100 being an exact match. 

```python
# Compare between two strings
from fuzzywuzzy import fuzz

# Compare reeding vs reading
fuzz.WRatio('Reeding', 'Reading')
```

```
86
```

Do not confuse this with the minimum edit distance score earlier, where a lower minimum edit distance means a closer match.

### Partial strings and different orderings
The WRatio function is highly robust against partial string comparison with different orderings. 

For example here we compare the strings Houston Rockets and Rockets, and still receive a high similarity score. 
```python
# Partial string comparison
fuzz.WRatio('Houston Rockets', 'Rockets')
```
```
90
```

The same can be said for the strings Houston Rockets vs Los Angeles Lakers and Lakers vs Rockets, where the team names are only partial and they are differently ordered.
```python
# Partial string comparison
fuzz.WRatio('Houston Rockets vs Los Angeles Lakers', 'Lakers vs Rockets')
```
```
86
```

### Comparison with arrays
We can also compare a string with an array of strings by using the extract function from the process module from fuzzy wuzzy. 
```python
# Import precess
from fuzzywuzzy import process

# Define string and array of possible matches
string = 'Houston Rockets vs Los Angeles Lakers'
choices = pd.Series(['Rockets vs Lakers', 'Lakers vs Rockets',
                    'Houston vs Los Angeles', 'Heat vs Bulls'])

process.extract(string, choices, limit = 2)
```
```
[('Rockets vs Lakers', 86, 0), ('Lakers vs Rockets', 86, 1)]
```
Extract takes in a string, an array of strings, and the number of possible matches to return ranked from highest to lowest. It returns a list of tuples with 3 elements, the first one being the matching string being returned, the second one being its similarity score, and the third one being its index in the array.

### Collapsing categories with string similarity
**Chapter 2**

Use `.replace()` to collapse `'eur` into `Europe`

In chapter 2, we learned that collapsing data into categories is an essential aspect of working with categorical and text data, and we saw how to manually replace categories in a column of a DataFrame. 

*What if there are too many variations?*

`'EU`, `'eur`, `'Europe'`, `'Europa'`, `'Erope'`, `'Evropa'` ...

But what if we had so many inconsistent categories that a manual replacement is simply not feasible? We can easily do that with string similarity!

Say we have DataFrame named survey containing answers from respondents from the state of New York and California asking them how likely are you to move on a scale of 0 to 5. 

```python
print(survey['state'].unique())
```
```
id           state
0       California
1             Cali
2       Calefornia
3       Calefornie
4       Californie
5        Calfornia
6       Calefernia
7         New York
8    New York City
```

The state field was free text and contains hundreds of typos. Remapping them manually would take a huge amount of time. Instead, we'll use string similarity. 

```python
catefories
```
```
   state
0  California
1  New York
```

We also have a category DataFrame containing the correct categories for each state. 

Let's collapse the incorrect categories with string matching!

### Collapsing all of the state

```python
# For each correct category
for state in categories['state']:
	# Find potential matches in states with typoes
	matches = process.extract(state, survey['state'], 
                              limit=survey.shape[0])
	# For each potential_match match
	for potential_match in matches:
		# If high similarity score
		if potential_match[1] >= 80:
			# Replace type with correct category
			survey.loc[survey['state'] == potential_match[0], 'state'] = state
```

We first create a for loop iterating over each correctly typed state in the categories DataFrame. 

For each state, we find its matches in the state column of the survey DataFrame, returning all possible matches by setting the limit argument of extract to the length of the survey DataFrame. 

Then we iterate over each potential match, isolating the ones only with a similarity score higher or equal than 80 with an if statement. 

Then for each of those returned strings, we replace it with the correct state using the loc method.

### Record linkage
Record linkage attempts to join data sources that have similarly fuzzy duplicate values, so that we end up with a final DataFrame with no duplicates by using string similarity

### Minimum edit distance
Minimum edit distance is the minimum number of steps needed to reach from String A to String B, with the operations available being:

**Insertion** of a new character.
**Deletion** of an existing character.
**Substitution** of an existing character.
**Transposition** of two existing consecutive characters.

*What is the minimum edit distance from `'sign'` to `'sing'`, and which operation(s) gets you there?*

1. ~2 by substituting `'g'` with `'n'` and `'n'` with `'g'`.~
2. **1 by transposing `'g'` with `'n'`.**
3. ~1 by substituting `'g'` with `'n'`.~
4. ~2 by deleting `'g'` and inserting a new `'g'` at the end.~

**Answer: 2.** Transposing the last two letters of `'sign'` is the easiest way to get to `'sing'`




## The cutoff point
The ultimate goal is to create a restaurant recommendation engine, but you need to first clean your data.

This version of the `restaurants` DataFrame has been collected from many sources, where the `cuisine_type` column is riddled with typos, and should contain only `italian`, `american` and `asian` cuisine types. There are so many unique categories that remapping them manually isn't scalable, and it's best to use string similarity instead.

Before doing so, you want to establish the cutoff point for the similarity score using the `fuzzywuzzy`'s `process.extract()` function by finding the similarity score of the most *distant* typo of each category.

In [1]:
import pandas as pd

restaurants = pd.read_csv('restaurants.csv')

In [2]:
restaurants.head()

Unnamed: 0.1,Unnamed: 0,name,addr,city,phone,type
0,0,kokomo,6333 w. third st.,la,2139330773,american
1,1,feenix,8358 sunset blvd. west,hollywood,2138486677,american
2,2,parkway,510 s. arroyo pkwy .,pasadena,8187951001,californian
3,3,r-23,923 e. third st.,los angeles,2136877178,japanese
4,4,gumbo,6333 w. third st.,la,2139330358,cajun/creole


- Import `process` from `fuzzywuzzy`.
- Store the unique `cuisine_types` into `unique_types`.
- Calculate the similarity of `'asian'`, `'american'`, and `'italian'` to all possible `cuisine_types` using `process.extract()`, while returning all possible matches.

In [8]:
# Import process from fuzzywuzzy
from fuzzywuzzy import process

In [12]:
# Store the unique values of cuisine_type in unique_types
unique_types = restaurants['type'].unique()

# Calculate similarity of 'asian' to all values of unique_types
print(process.extract('asian', unique_types, limit = len(unique_types)), '\n')

# Calculate similarity of 'american' to all values of unique_types
print(process.extract('american', unique_types, limit=len(unique_types)), '\n')

# Calculate similarity of 'italian' to all values of unique_types
print(process.extract('italian', unique_types, limit=len(unique_types)))

[('asian', 100), ('indonesian', 72), ('italian', 67), ('russian', 67), ('american', 62), ('californian', 54), ('japanese', 54), ('mexican/tex-mex', 54), ('american ( new )', 54), ('mexican', 50), ('cajun/creole', 36), ('middle eastern', 36), ('vietnamese', 36), ('pacific new wave', 36), ('fast food', 36), ('chicken', 33), ('hamburgers', 27), ('hot dogs', 26), ('coffeebar', 26), ('continental', 26), ('steakhouses', 25), ('southern/soul', 22), ('delis', 20), ('eclectic', 20), ('pizza', 20), ('health food', 19), ('diners', 18), ('coffee shops', 18), ('noodle shops', 18), ('french ( new )', 18), ('desserts', 18), ('seafood', 17), ('chinese', 17)] 

[('american', 100), ('american ( new )', 90), ('mexican', 80), ('mexican/tex-mex', 68), ('asian', 62), ('italian', 53), ('russian', 53), ('middle eastern', 51), ('pacific new wave', 45), ('hamburgers', 44), ('indonesian', 44), ('chicken', 40), ('southern/soul', 39), ('japanese', 38), ('eclectic', 38), ('delis', 36), ('pizza', 36), ('cajun/creole

In [None]:
# Store the unique values of cuisine_type in unique_types
unique_types = restaurants['cuisine_type'].unique()

# Calculate similarity of 'asian' to all values of unique_types
print(process.extract('asian', unique_types, limit = len(unique_types)), '\n')

# Calculate similarity of 'american' to all values of unique_types
print(process.extract('american', unique_types, limit=len(unique_types)), '\n')

# Calculate similarity of 'italian' to all values of unique_types
print(process.extract('italian', unique_types, limit=len(unique_types)))

output:
```
[('asian', 100), ('asiane', 91), ('asiann', 91), ('asiian', 91), ('asiaan', 91), ('asianne', 83), ('asiat', 80), ('italiann', 72), ('italiano', 72), ('italianne', 72), ('italian', 67), ('amurican', 62), ('american', 62), ('italiaan', 62), ('italiian', 62), ('itallian', 62), ('americann', 57), ('americano', 57), ('ameerican', 57), ('aamerican', 57), ('ameriican', 57), ('amerrican', 57), ('ammericann', 54), ('ameerrican', 54), ('ammereican', 54), ('america', 50), ('merican', 50), ('murican', 50), ('italien', 50), ('americen', 46), ('americin', 46), ('amerycan', 46), ('itali', 40)]

[('american', 100), ('americann', 94), ('americano', 94), ('ameerican', 94), ('aamerican', 94), ('ameriican', 94), ('amerrican', 94), ('america', 93), ('merican', 93), ('ammericann', 89), ('ameerrican', 89), ('ammereican', 89), ('amurican', 88), ('americen', 88), ('americin', 88), ('amerycan', 88), ('murican', 80), ('asian', 62), ('asiane', 57), ('asiann', 57), ('asiian', 57), ('asiaan', 57), ('italian', 53), ('asianne', 53), ('italiann', 50), ('italiano', 50), ('italiaan', 50), ('italiian', 50), ('itallian', 50), ('italianne', 47), ('asiat', 46), ('itali', 40), ('italien', 40)]

[('italian', 100), ('italiann', 93), ('italiano', 93), ('italiaan', 93), ('italiian', 93), ('itallian', 93), ('italianne', 88), ('italien', 86), ('itali', 83), ('asian', 67), ('asiane', 62), ('asiann', 62), ('asiian', 62), ('asiaan', 62), ('asianne', 57), ('amurican', 53), ('american', 53), ('americann', 50), ('asiat', 50), ('americano', 50), ('ameerican', 50), ('aamerican', 50), ('ameriican', 50), ('amerrican', 50), ('ammericann', 47), ('ameerrican', 47), ('ammereican', 47), ('america', 43), ('merican', 43), ('murican', 43), ('americen', 40), ('americin', 40), ('amerycan', 40)]
```    

### Question
Take a look at the output, what do you think should be the similarity cutoff point when remapping categories?

1. **80**
2. ~70~ - (*Not quite, `'asian'` and `'italiann'` have a similarity score of ***72***, meaning `'italiann'` would be converted to `'asian'`.*)
3. ~60~ - (*That's too low, and you risk converting a lot more categories than you should.*)

**Answer: 1.** **80** is that sweet spot where you convert all incorrect typos without remapping incorrect categories. 

## Remapping categories II
In the last exercise, you determined that the distance cutoff point for remapping typos of `'american'`, `'asian'`, and `'italian'` cuisine types stored in the `cuisine_type` column should be **80**.

In this exercise, you're going to put it all together by finding matches with similarity scores equal to or higher than 80 by using `fuzywuzzy.process`'s `extract()` function, for each correct cuisine type, and replacing these matches with it. Remember, when comparing a string with an array of strings using `process.extract()`, the output is a list of tuples where each is formatted like:
```
(closest match, similarity score, index of match)
```

- Return all of the unique values in the `cuisine_type` column of `restaurants`.

In [14]:
# Inspect the unique values of the cuisine_type column
print(restaurants['type'].unique())

['american' 'californian' 'japanese' 'cajun/creole' 'hot dogs' 'diners'
 'delis' 'hamburgers' 'seafood' 'italian' 'coffee shops' 'russian'
 'steakhouses' 'mexican/tex-mex' 'noodle shops' 'mexican' 'middle eastern'
 'asian' 'vietnamese' 'health food' 'american ( new )' 'pacific new wave'
 'indonesian' 'eclectic' 'chicken' 'fast food' 'southern/soul' 'coffeebar'
 'continental' 'french ( new )' 'desserts' 'chinese' 'pizza']


In [None]:
# Inspect the unique values of the cuisine_type column
print(restaurants['cuisine_type'].unique())

```
['america' 'merican' 'amurican' 'americen' 'americann' 'asiane' 'itali' 'asiann' 'murican' 'italien' 'italian' 'asiat' 'american' 'americano' 'italiann' 'ameerican' 'asianne' 'italiano' 'americin' 'ammericann' 'amerycan' 'aamerican' 'ameriican' 'italiaan' 'asiian' 'asiaan' 'amerrican' 'ameerrican' 'ammereican' 'asian' 'italianne' 'italiian' 'itallian']
```

Looks like you will need to use some string matching to correct these misspellings.

- As a first step, create a list of all possible `matches`, comparing `'italian'` with the restaurant types listed in the `cuisine_type` column.

In [16]:
# Create a list of matches, comparing 'italian' with the cuisine_type column
matches = process.extract('italian', restaurants['type'], limit=len(restaurants.type))

# Inspect the first 5 matches
print(matches[0:5])

[('italian', 100, 14), ('italian', 100, 21), ('italian', 100, 47), ('italian', 100, 57), ('italian', 100, 73)]


In [None]:
# Create a list of matches, comparing 'italian' with the cuisine_type column
matches = process.extract('italian', restaurants['cuisine_type'], limit=len(restaurants.cuisine_type))

# Inspect the first 5 matches
print(matches[0:5])

```
[('italian', 100, 11), ('italian', 100, 25), ('italian', 100, 41), ('italian', 100, 47), ('italian', 100, 49)]
```

Now you're getting somewhere! Now you can iterate through `matches` to reassign similar entries.

- Within the `for` loop, use an `if` statement to check whether the similarity score in each `match` is greater than or equal to 80.
- If it is, use `.loc` to select rows where `cuisine_type` in `restaurants` is equal to the current match (which is the first element of `match`), and reassign them to be `'italian'`.

In [17]:
# Iterate through the list of matches to italian
for match in matches:
  # Check whether the similarity score is greater than or equal to 80
  if match[1] >= 80:
    # Select all rows where the cuisine_type is spelled this way, and set them to the correct cuisine
    restaurants.loc[restaurants['type'] == match[0]] = 'italian'

Finally, you'll adapt your code to work with every restaurant type in `categories`.

- Using the variable `cuisine` to iterate through `categories`, embed your code from the previous step in an outer `for` loop.
- Inspect the final result.

In [18]:
categories = ['italian', 'asian', 'american']

In [23]:
# Iterate through categories
for cuisine in categories:  
  # Create a list of matches, comparing cuisine with the cuisine_type column
  matches = process.extract(cuisine, restaurants['type'], limit=len(restaurants.type))

  # Iterate through the list of matches
  for match in matches:
     # Check whether the similarity score is greater than or equal to 80
    if match[1] >= 80:
        # If it is, select all rows where the cuisine_type is spelled this way, and set them to the correct cuisine
        restaurants.loc[restaurants['type'] == match[0]] = cuisine

# Inspect the final result
print(restaurants['type'].unique())

['american' 'californian' 'japanese' 'cajun/creole' 'hot dogs' 'diners'
 'delis' 'hamburgers' 'seafood' 'italian' 'coffee shops' 'russian'
 'steakhouses' 'mexican/tex-mex' 'noodle shops' 'middle eastern' 'asian'
 'vietnamese' 'health food' 'pacific new wave' 'indonesian' 'eclectic'
 'chicken' 'fast food' 'southern/soul' 'coffeebar' 'continental'
 'french ( new )' 'desserts' 'chinese' 'pizza']


In [None]:
# Iterate through categories
for cuisine in categories:  
  # Create a list of matches, comparing cuisine with the cuisine_type column
  matches = process.extract(cuisine, restaurants['cuisine_type'], limit=len(restaurants.cuisine_type))

  # Iterate through the list of matches
  for match in matches:
     # Check whether the similarity score is greater than or equal to 80
    if match[1] >= 80:
        # If it is, select all rows where the cuisine_type is spelled this way, and set them to the correct cuisine
        restaurants.loc[restaurants['cuisine_type'] == match[0]] = cuisine

# Inspect the final result
print(restaurants['cuisine_type'].unique())

output:
```
['american' 'asian' 'italian']
```

*All your cuisine types are properly mapped. Now you'll build on string similarity, by jumping into record linkage.*

---

## Generating pairs
### Record linkage
Record linkage is the act of linking data from different sources regarding the same entity. Generally, we clean two or more DataFrames, generate pairs of potentially matching records, score these pairs according to string similarity and other similarity metrics, and link them. All of these steps can be achieved with the `recordlinkage` package.

### DataFrames
There are two DataFrames, census_A, and census_B, containing data on individuals throughout the states. 
```python
census_A
```
```
              given_name  surname  date_of_birth         suburb  state  address_1
rec_id
rec-1070-org    michaela  neumann       19151111  winston hills    cal  stanley street
rec-1016-org    courtney  painter       19161214      richlands    txs  pinkerton circuit
```
```python
census_B
```
```
                given_name  surname  date_of_birth      suburb  state  address_1
rec_id
rec-561-dup-0        elton      NaN       19651013  windermere     ny  light setreet
rec-2642-dup-0    mitchell    maxon       19390212  north ryde    cal  edkins street
```

We want to merge them while avoiding duplication using record linkage, since they are collected manually and are prone to typos, there are no consistent IDs between them.

### Generating pairs
We first want to generate pairs between both DataFrames. Ideally, we want to generate all possible pairs between our DataFrames, but what if we had big DataFrames and ended up having to generate millions if not billions of pairs? It wouldn't prove scalable and could seriously hamper development time.

### Blocking
This is where we apply what we call blocking, which creates pairs based on a matching column, which is in this case, the state column, reducing the number of possible pairs.

### Generating pairs
To do this, we first start off by importing recordlinkage. We then use the recordlinkage dot Index function, to create an indexing object. This essentially is an object we can use to generate pairs from our DataFrames. To generate pairs blocked on state, we use the block method, inputting the state column as input. 
```python
# Import recordlinkage
import recordlinkage

# Create indexing object
indexer = recordlinkage.Index()

# Generate pairs blocked on state
indexer.block('state')
pairs = indexer.index(census_A, census_B)
```

Once the indexer object has been initialized, we generate our pairs using the dot index method, which takes in the two dataframes.

```python
print(pairs)
```
```
MultiIndex(levels=[['rec-1007-org', 'rec-1016-org', 'rec-1054-org', 'rec-1066-org', 'rec-1070-org', 'rec-1075-org', 'rec-1080-org', 'rec-110-org', 'rec-1146-org', 'rec-1157-org', 'rec-1165-org', 'rec-1185-org', 'rec-1234-org', 'rec-1271-org', 'rec-1280-org', ......
66, 14, 13, 18, 34, 39, 0, 16, 80, 50, 20, 69, 28, 25, 49, 77, 51, 85, 52, 63, 74, 61, 83, 91, 22, 26, 55, 84, 11, 81, 97, 56, 27, 48, 2, 64, 5, 17, 29, 60, 72, 47, 92, 12, 95, 15, 19, 57, 37, 70, 94]], names=['rec_id_1', 'rec_id_2'])
```

The resulting object, is a pandas multi index object containing pairs of row indices from both DataFrames, which is a fancy way to say it is an array containing possible pairs of indices that makes it much easier to subset DataFrames on.

### Comparing the DataFrames
Since we've already generated our pairs, it's time to find potential matches. 

```python
# Generate the pairs
pairs = indexer.index(census_A, census_B)
# Create a Compare object
compare_cl = recordlinkage.Compare()

# Find exact matches for pairs of date_of_birth and state
compare_cl.exact('date_of_birth', 'date_of_birth', label='date_of_birth')
compare_cl.exacy('state', 'state', label='state')
# Find similar matches for pairs of surname and address_1 using string similarity
compare_cl.string('surname', 'surname', threshold=0.85, label='surname')
compare_cl.string('address_1', 'address_1', threshold=0.85, label='address_1')

# Find matches
potential_matches = compare_cl.compute(pairs, census_A, census_B)
```

We first start by creating a comparison object using the recordlinkage dot compare function. This is similar to the indexing object we created while generating pairs, but this one is responsible for assigning different comparison procedures for pairs. Let's say there are columns for which we want exact matches between the pairs. 

To do that, we use the exact method. It takes in the column name in question for each DataFrame, which is in this case date_of_birth and state, and a label argument which lets us set the column name in the resulting DataFrame. 

Now in order to compute string similarities between pairs of rows for columns that have fuzzy values, we use the dot string method, which also takes in the column names in question, the similarity cutoff point in the threshold argument, which takes in a value between 0 and 1, which we here set to 0.85. 

Finally to compute the matches, we use the compute function, which takes in the possible pairs, and the two DataFrames in question. 

Note that you need to always have the same order of DataFrames when inserting them as arguments when generating pairs, comparing between columns, and computing comparisons.

### Finding matching pairs
The output is a multi index DataFrame, where the first index is the row index from the first DataFrame, or census A, and the second index is a list of all row indices in census B. The columns are the columns being compared, with values being 1 for a match, and 0 for not a match.

```python
print(potential_matches)
```
```
                             date_of_birth  state  surname  address_1
rec_id_1     rec_id_2       
rec-1070-org rec-561-dup-0               0      1      0.0        0.0
             rec-2642-dup-0              0      1      0.0        0.0
             rec-608-dup-0               0      1      0.0        0.0
...
rec-1631-org rec-4070-dup-0              0      1      0.0        0.0
             rec-4862-dup-0              0      1      0.0        0.0
             rec-629-dup-0               0      1      0.0        0.0
...
```

To find potential matches, we just filter for rows where the sum of row values is higher than a certain threshold. Which in this case higher or equal to 2. But we'll dig deeper into these matches and see how to use them to link our census DataFrames in the next lesson.
```python
potential_matches[pontential_matches.sum(axis = 1) => 2]
```
```
                             date_of_birth  state  surname  address_1
rec_id_1     rec_id_2       
rec-4878-org rec-4878-dup-0              1      1      1.0        0.0
rec-417-org  rec-2867-dup-0              0      1      0.0        1.0
rec-3967-org rec-394-dup-0               0      1      1.0        0.0
rec-1373-org rec-4051-dup-0              0      1      1.0        0.0
             rec-802-dup-0               0      1      1.0        0.0
rec-3540-org rec-470-dup-0               0      1      1.0        0.0
```


## To link or not to link?
Similar to joins, record linkage is the act of linking data from different sources regarding the same entity. But unlike joins, record linkage does not require exact matches between different pairs of data, and instead can find close matches using string similarity. This is why record linkage is effective when there are no common unique keys between the data sources you can rely upon when linking data sources such as a unique identifier.

In this exercise, you will classify each card whether it is a traditional join problem, or a record linkage one.

- Classify each card into a problem that requires record linkage or regular joins.

Record linkage | Regular joins
:---|:---
Two customer DataFrames containing names and address, one with a unique identifier per customer, one without. | Two basketball DataFrames with a common unique identifier per game.
Using an `address` column to join two DataFrames, with the address in each DataFrame being formatted slightly differently. | Consolidationg two DataFrames containing details on the courses, with each course having its own unique indentifier.
Merging two basketball DataFrames, with columns `team_A`, `team_B`, and `time` and differently formatted team names between each DataFrame. |

*Don't make things more complicated than they need to be: record linkage is a powerful tool, but it's more complex than using a traditional join.*

## Pairs of restaurants
In the last lesson, you cleaned the `restaurants` dataset to make it ready for building a restaurants recommendation engine. You have a new DataFrame named `restaurants_new` with new restaurants to train your model on, that's been scraped from a new data source.

You've already cleaned the `cuisine_type` and `city` columns using the techniques learned throughout the course. However you saw duplicates with typos in restaurants names that require record linkage instead of joins with `restaurants`.

In this exercise, you will perform the first step in record linkage and generate possible pairs of rows between `restaurants` and `restaurants_new`.

- Instantiate an indexing object by using the `Index()` function from `recordlinkage`.
- Block your pairing on `cuisine_type` by using `indexer`'s' `.block()` method.
- Generate pairs by indexing `restaurants` and `restaurants_new` in that order.

In [25]:
import recordlinkage

In [26]:
restaurants_new = pd.read_csv('restaurants_new.csv')

In [27]:
restaurants_new.head()

Unnamed: 0.1,Unnamed: 0,name,addr,city,phone,type
0,0,arnie morton's of chicago,435 s. la cienega blv .,los angeles,3102461501,american
1,1,art's delicatessen,12224 ventura blvd.,studio city,8187621221,american
2,2,campanile,624 s. la brea ave.,los angeles,2139381447,american
3,3,fenix,8358 sunset blvd. west,hollywood,2138486677,american
4,4,grill on the alley,9560 dayton way,los angeles,3102760615,american


In [29]:
# Create an indexer and object and find possible pairs
indexer = recordlinkage.Index()

# Block pairing on cuisine_type
indexer.block('type')

# Generate pairs
pairs = indexer.index(restaurants, restaurants_new)

### Question
Now that you've generated your pairs, you've achieved the first step of record linkage. What are the steps remaining to link both restaurants DataFrames, and in what order?

1. **Compare between columns, score the comparison, then link the DataFrames.**
2. ~Clean the data, compare between columns, link the DataFrames, then score the comparison.~
3. ~Clean the data, compare between columns, score the comparison, then link the DataFrames.~

**Answer: 1.** Cleaning data precedes the pair generation phase, and linking DataFrames is the final step.

## Similar restaurants
In the last exercise, you generated pairs between `restaurants` and `restaurants_new` in an effort to cleanly merge both DataFrames using record linkage.

When performing record linkage, there are different types of matching you can perform between different columns of your DataFrames, including exact matches, string similarities, and more.

Now that your pairs have been generated and stored in `pairs`, you will find exact matches in the `city` and `cuisine_type` columns between each pair, and similar strings for each pair in the `rest_name` column.

- Instantiate a comparison object using the `recordlinkage.Compare()` function.

In [30]:
# Create a comparison object
comp_cl = recordlinkage.Compare()

- Use the appropriate `comp_cl` method to find exact matches between the `city` and `cuisine_type` columns of both DataFrames.
- Use the appropriate `comp_cl` method to find similar strings with a `0.8` similarity threshold in the `rest_name` column of both DataFrames.

In [31]:
# Create a comparison object
comp_cl = recordlinkage.Compare()

# Find exact matches on city, cuisine_types 
comp_cl.exact('city', 'city', label='city')
comp_cl.exact('type', 'type', label='type')

# Find similar matches of rest_name
comp_cl.string('rest_name', 'rest_name', label='name', threshold=0.8) 

<Compare>

- Compute the comparison of the pairs by using the `.compute()` method of `comp_cl`.

In [None]:
# Get potential matches and print
potential_matches = comp_cl.compute(pairs, restaurants, restaurants_new)
print(potential_matches)

output:
```
            city  cuisine_type  name
    0   0      0             1   0.0
        1      0             1   0.0
        7      0             1   0.0
        12     0             1   0.0
        13     0             1   0.0
    ...      ...           ...   ...
    40  18     0             1   0.0
    281 18     0             1   0.0
    288 18     0             1   0.0
    302 18     0             1   0.0
    308 18     0             1   0.0
    
    [3631 rows x 3 columns]
```

### Question
Print out `potential_matches`, the columns are the columns being compared, with values being 1 for a match, and 0 for not a match for each pair of rows in your DataFrames. To find potential matches, you need to find rows with more than matching value in a column. You can find them with
```python
potential_matches[potential_matches.sum(axis = 1) >= n]
```

Where `n` is the minimum number of columns you want matching to ensure a proper duplicate find, what do you think should the value of `n` be?

1. **3 because I need to have matches in all my columns.**
2. ~2 because matching on any of the 2 columns or more is enough to find potential duplicates.~ (If `n` is set to 2, then you will get duplicates for all restaurants with the same cuisine type in the same city.)
3. ~1 because matching on just 1 column like the restaurant name is enough to find potential duplicates.~ (What if you had restaurants with the same name in different cities?)

**Answer: 1.** For this example, tightening your selection criteria will ensure good duplicate finds. In the next lesson, you're gonna build on what you learned to link these two DataFrames.

In [None]:
potential_matches[potential_matches.sum(axis = 1) >= 3]

output:
```
       city  cuisine_type  name
0  40     1             1   1.0
1  28     1             1   1.0
2  74     1             1   1.0
3  1      1             1   1.0
4  53     1             1   1.0
8  43     1             1   1.0
9  50     1             1   1.0
13 7      1             1   1.0
14 67     1             1   1.0
17 12     1             1   1.0
20 20     1             1   1.0
21 27     1             1   1.0
5  65     1             1   1.0
7  79     1             1   1.0
12 26     1             1   1.0
18 71     1             1   1.0
6  73     1             1   1.0
10 75     1             1   1.0
11 21     1             1   1.0
16 57     1             1   1.0
19 47     1             1   1.0
15 55     1             1   1.0
```

---

## Linking DataFrames
### Potential matches
Let's look closely at the potential matches. It is a multi-index DataFrame, where we have two index columns, record id 1, and record id 2. 
```python
potential_matches
```
```
                             date_of_birth  state  surname  address_1
rec_id_1     rec_id_2       
rec-1070-org rec-561-dup-0               0      1      0.0        0.0
             rec-2642-dup-0              0      1      0.0        0.0
             rec-608-dup-0               0      1      0.0        0.0
...                                    ...    ...      ...        ...
rec-1631-org rec-1697-dup-0              0      1      0.0        0.0
             rec-4404-dup-0              0      1      0.0        0.0
             rec-3780-dup-0              0      1      0.0        0.0
...                                    ...    ...      ...        ...
```
The first index column, stores indices from census A. The second index column, stores all possible indices from census_B, for each row index of census_A. The columns of our potential matches are the columns we chose to link both DataFrames on, where the value is 1 for a match, and 0 otherwise.

### Probable matches
The first step in linking DataFrames, is to isolate the potentially matching pairs to the ones we're pretty sure of. We saw how to do this in the previous lesson, by subsetting the rows where the row sum is above a certain number of columns, in this case 3. 
```python
matches = potential_matches[potential_matches.sum(axis = 1) >= 3]
```
```
                             date_of_birth  state  surname  address_1
rec_id_1     rec_id_2       
rec-2404-org rec-2404-dup-0              1      1      1.0        1.0
rec-4178-org rec-4178-dup-0              1      1      1.0        1.0
rec-1054-org rec-1054-dup-0              1      1      1.0        1.0
...                                    ...    ...      ...        ...
rec-1234-org rec-1234-dup-0              1      1      1.0        1.0
rec-1271-org rec-1271-dup-0              1      1      1.0        1.0
```
The output is row indices between census A and census B that are most likely duplicates. The next step is to extract the one of the index columns, and subsetting its associated DataFrame to filter for duplicates.

Here we choose the second index column, which represents row indices of `census B`. We want to extract those indices, and subset `census_B` on them to remove duplicates with `census_A` before appending them together.

### Get the indices
```python
matches.index
```
```
MultiIndex(levels=[['rec-1007-org', 'rec-1016-org', 'rec-1054-org', 'rec-1066-org', 'rec-1070-org', 'rec-1075-org', 'rec-1080-org', 'rec-110-org',  ...
```
We can access a DataFrame's index using the index attribute. Since this is a multi index DataFrame, it returns a multi index object containing pairs of row indices from `census_A` and `census_B` respectively. 
```python
# Get indices from census_B only
duplicate_rows = matches.index.get_level_values(1)
print(census_B_index)
```
```
MultiIndex(levels=[['rec-2404-dup-0', 'rec-4178-dup-0', 'rec-1054-dup-0', 'rec-4663-dup-0', 'rec-485-dup-0', 'rec-2950-dup-0', 'rec-1234-dup-0', ... 'rec-299-oduprg-0'])
```
We want to extract all `census_B` indices, so we chain it with the get_level_values method, which takes in which column index we want to extract its values. We can either input the index column's name, or its order, which is in this case 1.

### Linking DataFrames
```python
# Finding duplicates in census_B
census_B_duplicates = census_B[census_B.index.isin(duplicate_rows)]

# Finding new rows in census_B
census_B_new = census_B[~census_B.index.isin(duplicate_rows)]
```
To find the duplicates in `census B`, we simply subset on all indices of `census_B`, with the ones found through record linkage. You can choose to examine them further for similarity with their duplicates in `census_A`, but if you're sure of your analysis, you can go ahead and find the non duplicates by repeating the exact same line of code, except by adding a tilde at the beginning of your subset. 

```python
# Link the DataFrames
full_census = census_A.append(census_B_new)
```
Now that you have your non duplicates, all you need is a simple append using the DataFrame append method of census A, and you have your linked Data.

To recap, what we did was build on top of our previous work in generating pairs, comparing across columns and finding potential matches. 

We then isolated all possible matches, where there are matches across 3 columns or more, ensuring we tightened our search for duplicates across both DataFrames before we link them. 

Extracted the row indices of `census_B` where there are duplicates. Found rows of `census_B` where they are not duplicated with `census_A` by using the tilde symbol. And linked both DataFrames for full census results.
```python
# Import recordlinkage and generate pairs and compare across columns
. . .
# Generate potential matches
potential_matches = compare_cl.compute(full_pairs, census_A, census_B)

# Isolate matches with matching values for 3 or more columns
matches = potential_matches[potential_matches.sum(axis = 1) >= 3]

# Get index for matching census_B rows only
duplicate_rows = matches.index.get_level_values(1)

# Finding new rows in census_B
census_B_new = census_B[~census_B.index.isin(duplicate_rows)]

# Link the DataFrame
full_census = census_A.append(census_B_new)
```

## Getting the right index
Here's a DataFrame named `matches` containing potential matches between two DataFrames, `users_1` and `users_2`. Each DataFrame's row indices is stored in `uid_1` and `uid_2` respectively.
```
             first_name  address_1  address_2  marriage_status  date_of_birth
uid_1 uid_2                                                                  
0     3              1          1          1                1              0
     ...            ...         ...        ...              ...            ...
     ...            ...         ...        ...              ...            ...
1     3              1          1          1                1              0
     ...            ...         ...        ...              ...            ...
     ...            ...         ...        ...              ...            ...
```
How do you extract all values of the `uid_1` index column?

1. **`matches.index.get_level_values(0)`**
2. ~`matches.index.get_level_values(1)`~
3. **`matches.index.get_level_values('uid_1')`**

**Answer: Both 1 and 3 are correct.**

## Linking them together!
In the last lesson, you've finished the bulk of the work on your effort to link `restaurants` and `restaurants_new`. You've generated the different pairs of potentially matching rows, searched for exact matches between the `cuisine_type` and `city` columns, but compared for similar strings in the `rest_name` column. You stored the DataFrame containing the scores in `potential_matches`.

Now it's finally time to link both DataFrames. You will do so by first extracting all row indices of `restaurants_new` that are matching across the columns mentioned above from `potential_matches`. Then you will subset `restaurants_new` on these indices, then append the non-duplicate values to restaurants.

- Isolate instances of `potential_matches` where the row sum is above or equal to 3 by using the `.sum()` method.
- Extract the second column index from `matches`, which represents row indices of matching record from `restaurants_new` by using the `.get_level_values()` method.
- Subset `restaurants_new` for rows that are not in `matching_indices`.
- Append `non_dup` to `restaurants`.

In [None]:
# Isolate potential matches with row sum >=3
matches = potential_matches[potential_matches.sum(axis = 1) >= 3]

# Get values of second column index of matches
matching_indices = matches.index.get_level_values(1)

# Subset restaurants_new based on non-duplicate values
non_dup = restaurants_new[~restaurants_new.index.isin(matching_indices)]

# Append non_dup to restaurants
full_restaurants = restaurants.append(non_dup)
print(full_restaurants)

output:
```
                        rest_name                  rest_addr               city         phone cuisine_type 
    0   arnie morton's of chicago   435 s. la cienega blv .         los angeles    3102461501     american 
    1          art's delicatessen       12224 ventura blvd.         studio city    8187621221     american   
    2                   campanile       624 s. la brea ave.         los angeles    2139381447     american
    3                       fenix    8358 sunset blvd. west           hollywood    2138486677     american
    4          grill on the alley           9560 dayton way         los angeles    3102760615     american
    ..                        ...                        ...                ...           ...          ...
    76                        don        1136 westwood blvd.           westwood    3102091422      italian 
    77                      feast        1949 westwood blvd.            west la    3104750400      chinese 
    78                   mulberry        17040 ventura blvd.             encino    8189068881        pizza   
    80                    jiraffe      502 santa monica blvd       santa monica    3109176671  californian  
    81                   martha's  22nd street grill 25 22nd  st. hermosa beach    3103767786     american  
   
    [396 rows x 5 columns]
```

*Linking the DataFrames is arguably the most straightforward step of record linkage. You are now ready to get started on that recommendation engine.*

In [35]:
import re
x = "123-456-789"
# Is x a valid phone number?
print(bool(re.compile("\d{3}-\d{3}-\d{4}").match(x)))

False
