<a href="https://colab.research.google.com/github/whylucify1/ABC-Fuzzy-string/blob/main/Fuzzy_string.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Coding in eight different fuzzy string techniques and analysis: 

In [None]:
# Jaccard Similarity:
def jaccard_similarity(str1, str2):
    set1 = set(str1.split()) # It can only consider the characters, instead of the sequence, nor the frequency of words.
    set2 = set(str2.split())
    return len(set1.intersection(set2)) / len(set1.union(set2))

## Jaccard similarity:
* Definition of jaccard similarity: The Jaccard similarity is one of the indicators to measure the similarity between texting and documents. The value of Jaccard similarity is between 0 and 1. The more the value closer to 1, the more similarities it will be.
* Pros and cons of jaccard similarity:  
** Pros: 
  1. It is easy to calculate and being defined in python coding.
  2. It is effective in comparing sets of binary values or presence/absence data. 
  3. Robust to the imbalance in set sizes.
** Cons: 
  1. Not sensitive to the order or frequency of elements in sets. (It cannot consider the sequence of each character, or the frequency of each word.)
  2. It may not perform well in numerical datasets.
  3. It can produce misleading results with very sparse sets.


In [None]:
# Levenshtein distance:
def levenshtein_distance(str1, str2):
    m = len(str1)
    n = len(str2)
    dp = [[0 for x in range(n + 1)] for x in range(m + 1)]

    for i in range(m + 1):
        for j in range(n + 1):
            if i == 0:
                dp[i][j] = j
            elif j == 0:
                dp[i][j] = i
            elif str1[i - 1] == str2[j - 1]:
                dp[i][j] = dp[i - 1][j - 1]
            else:
                dp[i][j] = 1 + min(dp[i][j - 1], dp[i - 1][j], dp[i - 1][j - 1])

    return dp[m][n]

In [None]:
# Hamming distance:
def hamming_distance(str1, str2):
    if len(str1) != len(str2):
        raise ValueError("Input strings must have the same length.")
    return sum(el1 != el2 for el1, el2 in zip(str1, str2))

In [None]:
# Damerau Levenshtein distance:
def damerau_levenshtein_distance(str1, str2):
    d = {}
    lenstr1 = len(str1)
    lenstr2 = len(str2)
    for i in range(-1, lenstr1 + 1):
        d[(i, -1)] = i + 1
    for j in range(-1, lenstr2 + 1):
        d[(-1, j)] = j + 1

    for i in range(lenstr1):
        for j in range(lenstr2):
            if str1[i] == str2[j]:
                cost = 0
            else:
                cost = 1
            d[(i, j)] = min(
                d[(i - 1, j)] + 1,
                d[(i, j - 1)] + 1,
                d[(i - 1, j - 1)] + cost,
            )
            if i and j and str1[i] == str2[j - 1] and str1[i - 1] == str2[j]:
                d[(i, j)] = min(d[(i, j)], d[i - 2, j - 2] + cost)

    return d[lenstr1 - 1, lenstr2 - 1]

In [None]:
# Longest common sebsequence (LCS): 
def longest_common_subsequence(str1, str2):
    m = len(str1)
    n = len(str2)
    dp = [[0 for j in range(n + 1)] for i in range(m + 1)]
    for i in range(1, m + 1):
        for j in range(1, n + 1):
            if str1[i - 1] == str2[j - 1]:
                dp[i][j] = dp[i - 1][j - 1] + 1
            else:
                dp[i][j] = max(dp[i - 1][j], dp[i][j - 1])
    index = dp[m][n]
    lcs = [""] * (index + 1)
    lcs[index] = ""
    i = m
    j = n
    while i > 0 and j > 0:
        if str1[i - 1] == str2[j - 1]:
            lcs[index - 1] = str1[i - 1]
            i -= 1
            j -= 1
            index -= 1
        elif dp[i - 1][j] > dp[i][j - 1]:
            i -= 1
        else:
            j -= 1
    return "".join(lcs)

In [None]:
# Cosine similarity:
import math

def cosine_similarity(vec1, vec2):
    dot_product = sum(x * y for x, y in zip(vec1, vec2))
    magnitude_vec1 = math.sqrt(sum(x ** 2 for x in vec1))
    magnitude_vec2 = math.sqrt(sum(x ** 2 for x in vec2))
    return dot_product / (magnitude_vec1 * magnitude_vec2)

This code assumes that vec1 and vec2 are lists representing the vectors to compare. The result is a float representing the cosine similarity between the vectors, with a range of -1 to 1.

What does cosine similarity represent: 

In [None]:
# Smith-Waterman Algorithm:
def smith_waterman(str1, str2, match=2, mismatch=-1, gap=-1):
    m = len(str1)
    n = len(str2)
    dp = [[0 for j in range(n + 1)] for i in range(m + 1)]
    max_score = 0
    for i in range(1, m + 1):
        for j in range(1, n + 1):
            score = max(
                0,
                dp[i - 1][j - 1] + (match if str1[i - 1] == str2[j - 1] else mismatch),
                dp[i - 1][j] + gap,
                dp[i][j - 1] + gap,
            )
            dp[i][j] = score
            max_score = max(max_score, score)
    return max_score

In [None]:
# Ratcliff/Obershelp Algorithm:
def find_split(A, B):
    m = len(A)
    n = len(B)
    for i in range(min(m, n)):
        if A[i] != B[i]:
            return i
    return min(m, n)

def ratcliff_obershelp(A, B):
    if not A or not B:
        return 0
    i = find_split(A, B)
    if i == len(A) or i == len(B):
        return i
    return (
        ratcliff_obershelp(A[:i], B[:i]) +
        ratcliff_obershelp(A[i:], B[i:])
    )