minimum edit distance is the minimum number of steps needed to reach from String A to String B, with the operations available being:

Insertion of a new character.
Deletion of an existing character.
Substitution of an existing character.
Transposition of two existing consecutive characters.

    What is the minimum edit distance from 'sign' to 'sing', and which operation(s) gets you there?
    - 1 by transposing 'g' with 'n'.

- Import process from fuzzywuzzy.
- Store the unique cuisine_types into unique_types.
- Calculate the similarity of 'asian', 'american', and 'italian' to all possible cuisine_types using process.extract(), while returning all possible matches.

In [12]:
import pandas as pd
restaurants = pd.read_csv("restaurants_L2.csv")
# Import process from fuzzywuzzy
from fuzzywuzzy import process

# Store the unique values of cuisine_type in unique_types
unique_types = restaurants['type'].unique()

# Calculate similarity of 'asian' to all values of unique_types
print(process.extract('asian', unique_types, limit = len(unique_types)))

# Calculate similarity of 'american' to all values of unique_types
print(process.extract('american', unique_types, limit = len(unique_types)))

# Calculate similarity of 'italian' to all values of unique_types
print(process.extract('italian', unique_types, limit = len(unique_types)))

[('asian', 100), ('italian', 67), ('american', 62), ('mexican', 50), ('cajun', 40), ('southwestern', 36), ('southern', 31), ('coffeebar', 26), ('steakhouses', 25)]
[('american', 100), ('mexican', 80), ('cajun', 68), ('asian', 62), ('italian', 53), ('southwestern', 41), ('southern', 38), ('coffeebar', 24), ('steakhouses', 21)]
[('italian', 100), ('asian', 67), ('mexican', 43), ('american', 40), ('cajun', 33), ('southern', 27), ('southwestern', 26), ('steakhouses', 26), ('coffeebar', 12)]


In [11]:
restaurants.columns

Index(['Unnamed: 0', 'name', 'addr', 'city', 'phone', 'type'], dtype='object')

Take a look at the output, what do you think should be the similarity cutoff point when remapping categories?
- 80

- Return all of the unique values in the cuisine_type column of restaurants.

In [13]:
# Inspect the unique values of the cuisine_type column
print(restaurants['type'].unique())

['american' 'asian' 'italian' 'coffeebar' 'mexican' 'southwestern'
 'steakhouses' 'southern' 'cajun']


Okay! Looks like you will need to use some string matching to correct these misspellings!

- As a first step, create a list of all possible matches, comparing 'italian' with the restaurant types listed in the cuisine_type column.

In [14]:
# Create a list of matches, comparing 'italian' with the cuisine_type column
matches = process.extract('italian', restaurants['type'], limit = len(restaurants['type']))

# Inspect the first 5 matches
print(matches[0:5])

[('italian', 100, 6), ('italian', 100, 10), ('italian', 100, 11), ('italian', 100, 16), ('italian', 100, 19)]


Now you're getting somewhere! Now you can iterate through matches to reassign similar entries.

- Within the for loop, use an if statement to check whether the similarity score in each match is greater than or equal to 80.
- If it is, use .loc to select rows where cuisine_type in restaurants is equal to the current match (which is the first element of match), and reassign them to be 'italian'.

In [16]:
# Create a list of matches, comparing 'italian' with the cuisine_type column
matches = process.extract('italian', restaurants['type'], limit=len(restaurants.type))

# Iterate through the list of matches to italian
for match in matches:
  # Check whether the similarity score is greater than or equal to 80
  if match[1] >= 80:
    # Select all rows where the cuisine_type is spelled this way, and set them to the correct cuisine
    restaurants.loc[restaurants.loc[:,'type'] == match[0]] = 'italian'

Finally, you'll adapt your code to work with every restaurant type in categories.

- Using the variable cuisine to iterate through categories, embed your code from the previous step in an outer for loop.
- Inspect the final result. This has been done for you.

In [19]:
categories = ['italian', 'asian', 'american']
# Iterate through categories
for cuisine in categories:  
  # Create a list of matches, comparing cuisine with the cuisine_type column
  matches = process.extract(cuisine, restaurants['type'], limit=len(restaurants.type))

  # Iterate through the list of matches
  for match in matches:
     # Check whether the similarity score is greater than or equal to 80
    if match[1] >= 80:
      # If it is, select all rows where the cuisine_type is spelled this way, and set them to the correct cuisine
      restaurants.loc[restaurants['type'] == match[0]] = cuisine
      
# Inspect the final result
print(restaurants['type'].unique())

['american' 'asian' 'italian' 'coffeebar' 'southwestern' 'steakhouses'
 'southern' 'cajun']


- Instantiate an indexing object by using the Index() function from recordlinkage.
- Block your pairing on cuisine_type by using indexer's' .block() method.
- Generate pairs by indexing restaurants and restaurants_new in that order.

In [26]:
import recordlinkage
restaurants_new = pd.read_csv("restaurants_L2_dirty.csv")

# Create an indexer and object and find possible pairs
indexer = recordlinkage.Index()

# Block pairing on cuisine_type
indexer.block('type')

# Generate pairs
pairs = indexer.index(restaurants, restaurants_new)

In [22]:
# conda install recordlinkage

Now that you've generated your pairs, you've achieved the first step of record linkage. What are the steps remaining to link both restaurants DataFrames, and in what order?
- Compare between columns, score the comparison, then link the DataFrames.

- Instantiate a comparison object using the recordlinkage.Compare() function.
- Use the appropriate comp_cl method to find exact matches between the city and cuisine_type columns of both DataFrames.
- Use the appropriate comp_cl method to find similar strings with a 0.8 similarity threshold in the rest_name column of both DataFrames.
- Compute the comparison of the pairs by using the .compute() method of comp_cl.

In [30]:
# Create a comparison object
comp_cl = recordlinkage.Compare()

# Find exact matches on city, cuisine_types - 
comp_cl.exact('city', 'city', label='city')
comp_cl.exact('type', 'type', label='cuisine_type')

# Find similar matches of rest_name
comp_cl.string('name', 'name', label='name', threshold = 0.8) 

# Get potential matches and print
potential_matches = comp_cl.compute(pairs, restaurants, restaurants_new)
print(potential_matches)

        city  cuisine_type  name
0   0      0             1   0.0
    1      0             1   0.0
    7      0             1   0.0
    12     0             1   0.0
    13     0             1   0.0
...      ...           ...   ...
40  18     0             1   0.0
281 18     0             1   0.0
288 18     0             1   0.0
302 18     0             1   0.0
308 18     0             1   0.0

[3784 rows x 3 columns]


`potential_matches[potential_matches.sum(axis = 1) >= n]`
Where n is the minimum number of columns you want matching to ensure a proper duplicate find, what do you think should the value of n be?
- 3 because I need to have matches in all my columns.

In [32]:
potential_matches[potential_matches.sum(axis = 1) >= 3]

Unnamed: 0,Unnamed: 1,city,cuisine_type,name
15,55,1,1,1.0


Here's a DataFrame named matches containing potential matches between two DataFrames, users_1 and users_2. Each DataFrame's row indices is stored in uid_1 and uid_2 respectively. They are together in a multi-index format.
How do you extract all values of the uid_1 index column?
- `matches.index.get_level_values(0)`
- `matches.index.get_level_values('uid_1')`

- Isolate instances of potential_matches where the row sum is above or equal to 3 by using the .sum() method.
- Extract the second column index from matches, which represents row indices of matching record from restaurants_new by using the .get_level_values() method.
- Subset restaurants_new for rows that are not in matching_indices.
- Append non_dup to restaurants.

In [33]:
# Isolate potential matches with row sum >=3
matches = potential_matches[potential_matches.sum(axis = 1) >= 3]

# Get values of second column index of matches
matching_indices = matches.index.get_level_values(1)

# Subset restaurants_new based on non-duplicate values
non_dup = restaurants_new[~restaurants_new.index.isin(matching_indices)]

# Append non_dup to restaurants
full_restaurants = restaurants.append(non_dup)
print(full_restaurants)

   Unnamed: 0        name                       addr               city  \
0    american    american                   american           american   
1    american    american                   american           american   
2    american    american                   american           american   
3    american    american                   american           american   
4    american    american                   american           american   
..        ...         ...                        ...                ...   
77         77       feast        1949 westwood blvd.            west la   
78         78    mulberry        17040 ventura blvd.             encino   
79         79  matsuhissa   129 n. la cienega blvd.       beverly hills   
80         80     jiraffe      502 santa monica blvd       santa monica   
81         81    martha's  22nd street grill 25 22nd  st. hermosa beach   

         phone         type  
0     american     american  
1     american     american  
2     ame