## Minimum edit distance

In the video exercise, you saw how minimum edit distance is used to identify how similar two strings are. As a reminder, minimum edit distance is the <u>minimum number</u> of steps needed to reach from **String A** to **String B**, with the operations available being:

- **Insertion** of a new character.
- **Deletion** of an existing character.
- **Substitution** of an existing character.
- **Transposition** of two existing consecutive characters.

_What is the minimum edit distance from `'sign'` to `'sing'`, and which operation(s) gets you there?_

1 by transposing `'g'` with `'n'`.

## The cutoff point

In this exercise, and throughout this chapter, you'll be working with the `restaurants` DataFrame which has data on various restaurants. Your ultimate goal is to create a restaurant recommendation engine, but you need to first clean your data.

This version of `restaurants` has been collected from many sources, where the `cuisine_type` column is riddled with typos, and should contain only `italian`, `american` and `asian` cuisine types. There are so many unique categories that remapping them manually isn't scalable, and it's best to use string similarity instead.

Before doing so, you want to establish the cutoff point for the similarity score using the `fuzzywuzzy`'s `process.extract()` function by finding the similarity score of the most _distant_ typo of each category.

Instructions

1. Import `process` from `fuzzywuzzy`.
2. Store the unique `cuisine_types` into `unique_types`.
3. Calculate the similarity of `'asian'`, `'american'`, and `'italian'` to all possible `cuisine_types` using `process.extract()`, while returning all possible matches.
4. Take a look at the output, what do you think should be the similarity cutoff point when remapping categories?

In [1]:
# Import pandas
import pandas as pd

# Import dataframe
restaurants = pd.read_csv('restaurants.csv')

# Import process from fuzzywuzzy
from fuzzywuzzy import process

# Store the unique values of cuisine_type in unique_types
unique_types = restaurants['cuisine_type'].unique()

# Calculate similarity of 'asian' to all values of unique_types
print(process.extract('asian', unique_types, limit = len(unique_types)))

# Calculate similarity of 'american' to all values of unique_types
print(process.extract('american', unique_types, limit = len(unique_types)))

# Calculate similarity of 'italian' to all values of unique_types
print(process.extract('italian', unique_types, limit = len(unique_types)))

[('asian', 100), ('asiane', 91), ('asiann', 91), ('asiian', 91), ('asiaan', 91), ('asianne', 83), ('asiat', 80), ('italiann', 72), ('italiano', 72), ('italianne', 72), ('italian', 67), ('amurican', 62), ('american', 62), ('italiaan', 62), ('italiian', 62), ('americann', 57), ('americano', 57), ('ameerican', 57), ('aamerican', 57), ('ameriican', 57), ('amerrican', 57), ('ammericann', 54), ('ameerrican', 54), ('ammereican', 54), ('america', 50), ('merican', 50), ('murican', 50), ('italien', 50), ('americen', 46), ('americin', 46), ('amerycan', 46), ('itali', 40)]
[('american', 100), ('americann', 94), ('americano', 94), ('ameerican', 94), ('aamerican', 94), ('ameriican', 94), ('amerrican', 94), ('america', 93), ('merican', 93), ('ammericann', 89), ('ameerrican', 89), ('ammereican', 89), ('amurican', 88), ('americen', 88), ('americin', 88), ('amerycan', 88), ('murican', 80), ('asian', 62), ('asiane', 57), ('asiann', 57), ('asiian', 57), ('asiaan', 57), ('italian', 53), ('asianne', 53), ('

## Remapping categories II

In the last exercise, you determined that the distance cutoff point for remapping typos of `'american'`, `'asian'`, and `'italian'` cuisine types stored in the `cuisine_type` column should be 80.

In this exercise, you're going to put it all together by finding matches with similarity scores equal to or higher than 80 by using `fuzywuzzy.process`'s `extract()` function, for each correct cuisine type, and replacing these matches with it. Remember, when comparing a string with an array of strings using `process.extract()`, the output is a list of tuples where each is formatted like:

```
(closest match, similarity score, index of match)
```

The `restaurants` DataFrame is in your environment, and you have access to a `categories` list containing the correct cuisine types (`'italian'`, `'asian'`, and `'american'`).

Instructions

1. Return all of the unique values in the `cuisine_type` column of `restaurants`.
2. Okay! Looks like you will need to use some string matching to correct these misspellings!
    - As a first step, create a list of `matches`, comparing `'italian'` with the restaurant types listed in the `cuisine_type` column.
3. Now you're getting somewhere! Now you can iterate through `matches` to reassign similar entries.
    - Within the `for` loop, use an `if` statement to check whether the similarity score in each `match` is greater than or equal to 80.
    - If it is, use `.loc` to select rows where `cuisine_type` in `restaurants` is _equal_ to the current match (which is the first element of `match`), and reassign them to be `'italian'`.
4. Finally, you'll adapt your code to work with every restaurant type in `categories`.
    - Using the variable `cuisine` to iterate through `categories`, embed your code from the previous step in an outer `for` loop.
    - Inspect the final result. _This has been done for you_.

In [2]:
# Inspect the unique values of the cuisine_type column
print(restaurants['cuisine_type'].unique())

['america' 'merican' 'amurican' 'americen' 'americann' 'asiane' 'itali'
 'asiann' 'murican' 'italien' 'italian' 'asiat' 'american' 'americano'
 'italiann' 'ameerican' 'asianne' 'italiano' 'americin' 'ammericann'
 'amerycan' 'aamerican' 'ameriican' 'italiaan' 'asiian' 'asiaan'
 'amerrican' 'ameerrican' 'ammereican' 'asian' 'italianne' 'italiian']


In [3]:
# Create a list of matches, comparing 'italian' with the cuisine_type column
matches = process.extract('italian', restaurants['cuisine_type'], limit=len(restaurants.cuisine_type))

# Inspect the first 5 matches
print(matches[0:5])

[('italian', 100, 11), ('italian', 100, 24), ('italian', 100, 37), ('italian', 100, 43), ('italian', 100, 44)]


In [4]:
# Create a list of matches, comparing 'italian' with the cuisine_type column
matches = process.extract('italian', restaurants['cuisine_type'], limit=len(restaurants.cuisine_type))

# Iterate through the list of matches to italian
for match in matches:
  # Check whether the similarity score is greater than or equal to 80
  if match[1] >= 80:
    # Select all rows where the cuisine_type is spelled this way, and set them to the correct cuisine
    restaurants.loc[restaurants['cuisine_type'] == match[0]] = 'italian'

In [5]:
categories = ['italian', 'asian', 'american']

# Iterate through categories
for cuisine in categories:  
  # Create a list of matches, comparing cuisine with the cuisine_type column
  matches = process.extract(cuisine, restaurants['cuisine_type'], limit=len(restaurants.cuisine_type))

  # Iterate through the list of matches
  for match in matches:
     # Check whether the similarity score is greater than or equal to 80
    if match[1] >= 80:
      # If it is, select all rows where the cuisine_type is spelled this way, and set them to the correct cuisine
      restaurants.loc[restaurants['cuisine_type'] == match[0]] = cuisine
    
# Inspect the final result
print(restaurants['cuisine_type'].unique())

['american' 'asian' 'italian']


## To link or not to link?
Similar to joins, record linkage is the act of linking data from different sources regarding the same entity. But unlike joins, record linkage does not require exact matches between different pairs of data, and instead can find close matches using string similarity. This is why record linkage is effective when there are no common unique keys between the data sources you can rely upon when linking data sources such as a unique identifier.

In this exercise, you will classify each card whether it is a traditional join problem, or a record linkage one.

Instructions

1. Classify each card into a problem that requires record linkage or regular joins.

- Record linkage
    - Two customer DataFrames containing names and adress, one with a unique identifier per customer, one without.
    - Merging two basketballs DataFrame, with columns `team_A`, `team_B`, and `time` and differently formatted team names between each DataFrame.
    - Using an `address` column to join two DataFrames, with the address in each DataFrame being formatted slightly differently.

- Regular joins
    - Two basketball DataFrame with a common unique identifier per game.
    - Consolidating two DataFrames containing details on DataCamp courses its own unique identifier.

## Pairs of restaurants

In the last lesson, you cleaned the `restaurants` dataset to make it ready for building a restaurants recommendation engine. You have a new DataFrame named `restaurants_new` with new restaurants to train your model on, that's been scraped from a new data source.

You've already cleaned the `cuisine_type` and `city` columns using the techniques learned throughout the course. However you saw duplicates with typos in restaurants names that require record linkage instead of joins with `restaurants`.

In this exercise, you will perform the first step in record linkage and generate possible pairs of rows between `restaurants` and `restaurants_new`. Both DataFrames, `pandas` and `recordlinkage` are in your environment.

Instructions

1. Instantiate an indexing object by using the `Index()` function from `recordlinkage`.
2. Block your pairing on `cuisine_type` by using `indexer`'s' `.block()` method.
3. Generate pairs by indexing `restaurants` and `restaurants_new` in that order.
4. Now that you've generated your pairs, you've achieved the first step of record linkage. What are the steps remaining to link both restaurants DataFrames, and in what order?

In [20]:
# Import recordlinkage
import recordlinkage

# Import dataset
restaurants_new = pd.read_csv('restaurants_new.csv')

In [21]:
# Create an indexer and object and find possible pairs
indexer = recordlinkage.Index()

# Block pairing on cuisine_type
indexer.block('cuisine_type')

# Generate pairs
pairs = indexer.index(restaurants, restaurants_new)

Compare between columns, score the comparison, then link the DataFrames.

## Similar restaurants

In the last exercise, you generated pairs between `restaurants` and `restaurants_new` in an effort to cleanly merge both DataFrames using record linkage.

When performing record linkage, there are different types of matching you can perform between different columns of your DataFrames, including exact matches, string similarities, and more.

Now that your pairs have been generated and stored in `pairs`, you will find exact matches in the `city` and `cuisine_type` columns between each pair, and similar strings for each pair in the `rest_name` column. Both DataFrames, `pandas` and `recordlinkage` are in your environment.

Instructions

1. Instantiate a comparison object using the `recordlinkage.Compare()` function.
2. Use the appropriate `comp_cl` method to find exact matches between the `city` and `cuisine_type` columns of both DataFrames.
3. Use the appropriate `comp_cl` method to find similar strings with a `0.8` similarity threshold in the `rest_name` column of both DataFrames.
4. Compute the comparison of the pairs by using the `.compute()` method of `comp_cl`.
5. Print out `potential_matches`, the columns are the columns being compared, with values being 1 for a match, and 0 for not a match for each pair of rows in your DataFrames. To find potential matches, you need to find rows with more than matching value in a column. You can find them with `potential_matches[potential_matches.sum(axis = 1) >= n]`. Where `n` is the minimum number of columns you want matching to ensure a proper duplicate find, what do you think should the value of `n` be?

In [23]:
# Create a comparison object
comp_cl = recordlinkage.Compare()

# Create a comparison object
comp_cl = recordlinkage.Compare()

# Find exact matches on city, cuisine_types 
comp_cl.exact('city', 'city', label='city')
comp_cl.exact('cuisine_type', 'cuisine_type', label='cuisine_type')

# Find similar matches of rest_name
comp_cl.string('name', 'name', label='name', threshold = 0.8)

# Get potential matches and print
potential_matches = comp_cl.compute(pairs, restaurants, restaurants_new) 
print(potential_matches)

       city  cuisine_type  name
0   0     0             1   0.0
1   0     0             1   0.0
2   0     0             1   0.0
3   0     0             1   0.0
4   0     0             1   0.0
...     ...           ...   ...
323 0     0             1   0.0
324 0     0             1   0.0
325 0     0             1   0.0
327 0     0             1   0.0
334 0     0             1   0.0

[153 rows x 3 columns]


3 because I need to have matches in all my columns.

## Getting the right index

Here's a DataFrame named `matches` containing potential matches between two DataFrames, `users_1` and `users_2`. Each DataFrame's row indices is stored in `uid_1` and `uid_2` respectively.

```
             first_name  address_1  address_2  marriage_status  date_of_birth
uid_1 uid_2                                                                  
0     3              1          1          1                1              0
     ...            ...         ...        ...              ...            ...
     ...            ...         ...        ...              ...            ...
1     3              1          1          1                1              0
     ...            ...         ...        ...              ...            ...
     ...            ...         ...        ...              ...            ...
```

How do you extract all values of the `uid_1` index column?

`matches.index.get_level_values(0)` or `matches.index.get_level_values('uid_1')`

## Linking them together!

In the last lesson, you've finished the bulk of the work on your effort to link `restaurants` and `restaurants_new`. You've generated the different pairs of potentially matching rows, searched for exact matches between the `cuisine_type` and `city` columns, but compared for similar strings in the `rest_name` column. You stored the DataFrame containing the scores in `potential_matches`.

Now it's finally time to link both DataFrames. You will do so by first extracting all row indices of `restaurants_new` that are matching across the columns mentioned above from `potential_matches`. Then you will subset `restaurants_new` on these indices, then append the non-duplicate values to `restaurants`. All DataFrames are in your environment, alongside `pandas` imported as `pd`.

Instructions

1. Isolate instances of `potential_matches` where the row sum is above or equal to 3 by using the `.sum()` method.
2. Extract the second column index from `matches`, which represents row indices of matching record from `restaurants_new` by using the `.get_level_values()` method.
3. Subset `restaurants_new` for rows that are not in `matching_indices`.
4. Append `non_dup` to `restaurants`.

In [24]:
# Isolate potential matches with row sum >=3
matches = potential_matches[potential_matches.sum(axis = 1) >= 3]

# Get values of second column index of matches
matching_indices = matches.index.get_level_values(1)

# Subset restaurants_new based on non-duplicate values
non_dup = restaurants_new[~restaurants_new.index.isin(matching_indices)]

# Append non_dup to restaurants
full_restaurants = restaurants.append(non_dup)
full_restaurants

Unnamed: 0.1,Unnamed: 0,name,addr,city,phone,cuisine_type
0,american,american,american,american,american,american
1,american,american,american,american,american,american
2,american,american,american,american,american,american
3,american,american,american,american,american,american
4,american,american,american,american,american,american
...,...,...,...,...,...,...
77,77,feast,1949 westwood blvd.,west la,3104750400,chinese
78,78,mulberry,17040 ventura blvd.,encino,8189068881,pizza
79,79,matsuhissa,129 n. la cienega blvd.,beverly hills,3106599639,asian
80,80,jiraffe,502 santa monica blvd,santa monica,3109176671,californian
