# Text Similarity - Levenshtein Distance

For the purposes of this tutorial I will show you the code necessary to implement Levenshtein distance in Python but for the actual text similarity pipeline we will be using the Levenshtein-python package. You can look through the installation guidelines [here](https://pypi.org/project/python-Levenshtein/) or run the following command:   
```console
pip install python-Levenshtein  
```  

### Introduction to Text Similarity
Identifying similarity between text is a common problem in NLP and is used by many companies world wide. The most common application of text similarity comes from the form of identifying plagiarized text. Educational facilities ranging from elementary school, high school, college and universities all around the world use services like Turnitin to ensure the work submitted by students is original and their own. Other applications of text similarity is commonly used by companies which have a similar structure to Stack Overflow or Stack Exchange. They want to be able to identify and flag duplicated questions so the user posting the question can be referenced to the original post with the solution. This increases the number of unique questions being asked on their platform.  

Text similarity can be broken down into two components, semantic similarity and lexical similarity. Given a pair of text, the semantic similarity of the pair refers to how close the documents are in meaning. Whereas, lexical similarity is a measure of overlap in vocabulary. If both documents in the pairs have the same vocabularies, then they would have a lexical similarity of 1 and vice versa of 0 if there was no overlap in vocabularies [2].    

Achieving true semantic similarity is a very difficult and unsolved task in both NLP and Mathematics. It's a heavily researched area and a lot of the solutions proposed does involve a certain degree of lexical similarity in them. For the focuses of this article, I will not dive much deeper into semantic similarity, but focus a lot more on lexical similarity.  

### Levenshtein Distance  
There are many ways to identify the lexical similarities between a pair of text, the one which we'll be covering today in this article is Levenshtein distance. An algorithm invented in 1965 by Vladimir Levenshtein, a Soviet mathematician [1].  
### Intuition  
>Levenshtein distance is very impactful because it does not require two strings to be of equal length for them to be compared. >Intuitively speaking, Levenshtein distance is quite easy to understand.  
>Informally, the Levenshtein distance between two words is the minimum number of single-character edits (insertions, deletions or substitutions) required to change one word into the other. [1]  
>- https://en.wikipedia.org/wiki/Levenshtein_distance  

Essentially implying that the output distance between the two is the cumulative sum of the single-character edits. The larger the output distance is implies that more changes were necessary to make the two words equal each other, and the lower the output distance is implies that fewer changes were necessary. For example, given a pair of words dream and dream the resulting Levenshtein distance would be 0 because the two words are the same. However, if the words were dream and steam the Levenshtein distance would be 2 as you would need to make 2 edits to change dr to st .
Thus a large value for Levenshtein distance implies that the two documents were not similar, and a small value for the distance implies that the two documents were similar.

## Implement Levenshtein Distance

The Python code associated to implementing  Levenshtein distance using dynamic programming. The same code can be implemented through a brute force and iterative solution (be aware that the brute force solution would not be optimal in terms of time complexity).  

In [1]:
from functools import lru_cache

In [2]:
def lev_dist(a, b):
    '''
    This function will calculate the levenshtein distance between two input
    strings a and b
    
    params:
        a (String) : The first string you want to compare
        b (String) : The second string you want to compare
        
    returns:
        This function will return the distnace between string a and b.
        
    example:
        a = 'stamp'
        b = 'stomp'
        lev_dist(a,b)
        >> 1.0
    '''
    
    @lru_cache(None)  # for memorization
    def min_dist(s1, s2):

        if s1 == len(a) or s2 == len(b):
            return len(a) - s1 + len(b) - s2

        # no change required
        if a[s1] == b[s2]:
            return min_dist(s1 + 1, s2 + 1)

        return 1 + min(
            min_dist(s1, s2 + 1),      # insert character
            min_dist(s1 + 1, s2),      # delete character
            min_dist(s1 + 1, s2 + 1),  # replace character
        )

    return min_dist(0, 0)

In [3]:
sq1 = 'saturday'
sq2 = 'sunday'
lev_dist(sq1, sq2)

3

## Text Similarity

In [4]:
import re
import numpy as np
import Levenshtein as lev

### Problem Statement 
Similar to softwares like Turnitin, we want to build a pipeline which identifies if an input article is plagiarized.   

### Solution Architecture
To solve this problem we're going to make a large initial assumption. The assumption is that we have a large corpus of labelled documents which we want to cross reference with this particular user input document. We can then clean the user input document for redundancies like stopwords and punctuations to better optimize the calculation of Levenshtein distance. We pass this cleaned document through each document in our corpus under the same tag as the user input document and identify if there is any document which is very similar to the user submitted document.   

## Fetch Data

## Clean Data

## Find Similarity

## Concluding Remarks
Levenshtein distance is a lexical similarity measure which identifies the distance between one a pair of strings. It does so by counting the number of times you would have to insert, delete or substitute a character from string 1 to make it like string 2. The larger the distance between the pair implies that the strings are not similar to each other and vice versa.  
I created this pipeline in a manner such that its easily integratabtle with other text similarity measures. Levenshtein distance is a great measure to use to identify lexical similarity between a pair of text, but it does not mean there aren't other well performing similarity measures. The Jaro-Winkler score in particular comes to mind and can be easily implemented in this pipeline. Be aware that the Jaro similarity outputs a result which is interpreted differently than the Levenshtein distance.  
You can follow through with this pipeline in the Jupyter Notebook I created for this project. You can find the notebook on my GitHub page [here](https://github.com/vatsal220/medium_articles/blob/main/levenshtein_distance/lev_dist.ipynb).  

## Resources
- [1] https://en.wikipedia.org/wiki/Levenshtein_distance
- [2] https://en.wikipedia.org/wiki/Lexical_similarity
- [3] https://pypi.org/project/python-Levenshtein/
---

In [29]:
import numpy as np

def levenshtein(seq1, seq2):
    size_x = len(seq1) + 1
    size_y = len(seq2) + 1
    matrix = np.zeros ((size_x, size_y))
    for x in range(size_x):
        matrix [x, 0] = x
    for y in range(size_y):
        matrix [0, y] = y

    for x in range(1, size_x):
        for y in range(1, size_y):
            if seq1[x-1] == seq2[y-1]:
                matrix [x,y] = min(
                    matrix[x-1, y] + 1,
                    matrix[x-1, y-1],
                    matrix[x, y-1] + 1
                )
            else:
                matrix [x,y] = min(
                    matrix[x-1,y] + 1,
                    matrix[x-1,y-1] + 1,
                    matrix[x,y-1] + 1
                )
    print (matrix)
    return (matrix[size_x - 1, size_y - 1])

In [39]:
a = 'stamp'
b = 'stomp'
levenshtein(a,b)

[[0. 1. 2. 3. 4. 5.]
 [1. 0. 1. 2. 3. 4.]
 [2. 1. 0. 1. 2. 3.]
 [3. 2. 1. 1. 2. 3.]
 [4. 3. 2. 2. 1. 2.]
 [5. 4. 3. 3. 2. 1.]]


1.0