After inspired by the FastLCS post couple of days ago, I found faster implementation of SequenceMatcher and Levenshtein as well, Efficiency is evaluated in a way similar to the FastLCS notebook.

https://www.kaggle.com/code/chack3/fastlcs

Many codes used in this notebook is copied from above amazing notebook.


New SequenceMatcher package showed a speedup from 19.7 seconds to 13 seconds

New Levenshtein package showed a speedup from 375 ms to 234 ms

### 1. Install & Import packages needed

In [None]:
!pip install cdifflib polyleven

In [None]:
from polyleven import levenshtein
from cdifflib import CSequenceMatcher

import Levenshtein
from difflib import SequenceMatcher

import pandas as pd
import numpy as np

### 2. Firstly evaluate on single string

In [None]:
string1 = "This string is to test speed of different implementation"
string2 = "This other string is to measure speed of different implementation"

In [None]:
%%timeit

SequenceMatcher(None, string1, string2).ratio()

In [None]:
%%timeit

CSequenceMatcher(None, string1, string2).ratio()

In [None]:
%%timeit

Levenshtein.distance(string1, string2)

In [None]:
%%timeit

levenshtein(string1, string2)

### 3. Evaluate on name (578907 entries) in pairs_df.csv

In [None]:
pairs_df = pd.read_csv('../input/foursquare-location-matching/pairs.csv')
pairs_df = pairs_df.fillna("")
print(len(pairs_df),"pairs")

# Since dataframes are expensive to retrieve, they are stored in tuple in advance.
names_1 = tuple(pairs_df["name_1"])
names_2 = tuple(pairs_df["name_2"])

In [None]:
%%timeit
# Original SequenceMatcher 
# 19.7 s ± 59.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
for name_1,name_2 in zip(names_1,names_2):
    d1 = SequenceMatcher(None, name_1, name_2).ratio()

In [None]:
%%timeit
# C implement SequenceMatcher
# 13 s ± 142 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
for name_1,name_2 in zip(names_1,names_2):
    d1 = CSequenceMatcher(None, name_1,name_2).ratio()

In [None]:
%%timeit
# Original Levenshtein
# 375 ms ± 11.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
for name_1,name_2 in zip(names_1,names_2):
    d1 = Levenshtein.distance(name_1,name_2)

In [None]:
%%timeit
# New levenshtein
# 234 ms ± 979 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
for name_1,name_2 in zip(names_1,names_2):
    d1 = levenshtein(name_1,name_2)

### 4. Check if all values match

In [None]:
# Check that all values match
for name_1,name_2 in zip(names_1,names_2):
    d1 = SequenceMatcher(None, name_1, name_2).ratio()
    d2 = CSequenceMatcher(None, name_1,name_2).ratio()
    if d1!=d2:
        print("ng")
        break
else:
    print("ok")

In [None]:
# Check that all values match
for name_1,name_2 in zip(names_1,names_2):
    d1 = Levenshtein.distance(name_1,name_2)
    d2 = levenshtein(name_1,name_2)
    if d1!=d2:
        print("ng")
        break
else:
    print("ok")