# Edit distance 

To determine how similar or dissimilar two text  samples are. It quantifies the distance as a numeric value computed through the number of minimum possible operations required to reach from source text (or simply text1) to target text (or simply text 2). This operation can be performed at both word and character level. Following are the three different implementations of edit distance where the difference is the type of operations allowed and their costs.

The method does not use any packages other than basic `random` and `string` packages to implement. The environment random seed is preserved for reproducibility

In [8]:
import random
import string
from utils import (simple_edit_distance,
            levenshtein_edit_distance, damerau_levenshtein_distance, randomize_text)
random.seed(13)

Here we can take any two text samples on which the methods can be called. For example, we consider toxic social media post and its detoxified version.

In [9]:
# Sample text (as random tweets)

post1 = "My page should be protected first so that worthless scum like you can't keep vandalizing it."
post2 = "My page should be protected first so that unpleasant people like you can't keep vandalizing it."


## 1. Simple Edit Distance
It allows insertion, deleletion and substitution operations, all having cost 1. The method implementation operates at both word and character levels.

In [10]:
# 1. Simple edit distance at character / word level (default is character)
sed_c = simple_edit_distance(post1, post2, level='c')
sed_w = simple_edit_distance(post1, post2, level='w')
print("Simple edit distance (at word level): ", sed_w)
print("Simple edit distance (at char level): ", sed_c)

Simple edit distance (at word level):  2
Simple edit distance (at char level):  15


## 2. Levenshtein Edit Distance
It has insertion and deletion with cost 1, while substitution has cost 2. It is also equivalent to saying that substitution is not allowed as both would incur the same over all cost. The method implementation operates at both word and character levels.

In [11]:
# 2. Simple levenshtein edit distance at character / word level (default is character)
led_c = levenshtein_edit_distance(post1, post2, level = 'c')
led_w = levenshtein_edit_distance(post1, post2, level = 'w')
print("Levenshtein edit distance (at word level): ", led_w)
print("Levenshtein edit distance (at char level): ", led_c)

Levenshtein edit distance (at word level):  4
Levenshtein edit distance (at char level):  23


## 3. Damerau Levenshtein Edit Distance
In addition to simple edit distance, it allows the transposition operation as well. All 4 operations i.e., insertion, deletion, substitution and transposition have cost 1. The method implementation operates at both word and character levels.

In [12]:
# 3. Simple damerau levenstein edit distance at character / word level (default is character)
dled_c = damerau_levenshtein_distance(post1, post2, level = 'c')
dled_w = damerau_levenshtein_distance(post1, post2, level = 'w')
print("Damerau levenshtein edit distance (at word level): ", dled_w)
print("Damerau levenshtein edit distance (at char level): ", dled_c)


Damerau levenshtein edit distance (at word level):  2
Damerau levenshtein edit distance (at char level):  15


## 4. (Bonus) Text Distortion
It use randomly selected operations (insertion, deletion, substitution and transposition) for given number of revolutions/spins to randomize/distort the text. The method would particularly be helpful in losing personalized information e.g., social media posts with people names and affiliations.

The following examples shows the original text with 10, 50 and 100 spins of distortion.

In [7]:
# 4. Randomize text
randomized_10 = randomize_text(post1, 10) # default revs = 1
randomized_50 = randomize_text(post1, 50) # default revs = 1
randomized_100 = randomize_text(post1, 100) # default revs = 1
print("Original post: ", post1)
print("Randomize text (10 revs): ", randomized_10) 
print("Randomize text (50 revs):", randomized_50) 
print("Randomize text (100 revs): ", randomized_100) 

Original post:  My page should be protected first so that worthless scum like you can't keep vandalizing it.
Randomize text (10 revs):  ynage sould Pbe protected first so thHtworthless scumlike you cant keep vandalziing it.
Randomize text (50 revs): My paTgs hJdlgepaodrctgd firgtjC s thwato rNvhlesss czml ikCe youcH Wan't keeT ndaliing t.
Randomize text (100 revs):  ytyMZg TtsAuuoWdb ckroteVdfPsrt  othkCw ohtlesss crmkHabuG NFcants'UkepandINialviIig yt.
