# Edit distance 

To determine how similar or dissimilar two text  samples are. It quantifies the distance as a numeric value computed through the number of minimum possible operations required to reach from source text (or simply text1) to target text (or simply text 2). This operation can be performed at both word and character level. Following are the three different implementations of edit distance where the difference is the type of operations allowed and their costs.

The method does not use any packages other than basic `random` and `string` packages to implement. The environment random seed is preserved for reproducibility

In [4]:
import random
import string
from utils import (simple_edit_distance,
            levenshtein_edit_distance, damerau_levenshtein_distance, randomize_text)
random.seed(13)

Here we can take any two text samples on which the methods can be called. For example, we consider toxic social media post and its detoxified version.

In [5]:
# Sample text (as random tweets)

post1 = "Excited to share our latest research on AI and its impact on social sciences! Leveraging data for better insights"
post2 = "Thrilled about our new findings on how AI transforms social science research. Innovation meets impact!"


## 1. Simple Edit Distance
It allows insertion, deleletion and substitution operations, all having cost 1. The method implementation operates at both word and character levels.

In [6]:
# 1. Simple edit distance at character / word level (default is character)
sed_c = simple_edit_distance(post1, post2, level='c')
sed_w = simple_edit_distance(post1, post2, level='w')
print("Simple edit distance (at word level): ", sed_w)
print("Simple edit distance (at char level): ", sed_c)

Simple edit distance (at word level):  16
Simple edit distance (at char level):  71


## 2. Levenshtein Edit Distance
It has insertion and deletion with cost 1, while substitution has cost 2. It is also equivalent to saying that substitution is not allowed as both would incur the same over all cost. The method implementation operates at both word and character levels.

In [7]:
# 2. Simple levenshtein edit distance at character / word level (default is character)
led_c = levenshtein_edit_distance(post1, post2, level = 'c')
led_w = levenshtein_edit_distance(post1, post2, level = 'w')
print("Levenshtein edit distance (at word level): ", led_w)
print("Levenshtein edit distance (at char level): ", led_c)

Levenshtein edit distance (at word level):  26
Levenshtein edit distance (at char level):  109


## 3. Damerau Levenshtein Edit Distance
In addition to simple edit distance, it allows the transposition operation as well. All 4 operations i.e., insertion, deletion, substitution and transposition have cost 1. The method implementation operates at both word and character levels.

In [8]:
# 3. Simple damerau levenstein edit distance at character / word level (default is character)
dled_c = damerau_levenshtein_distance(post1, post2, level = 'c')
dled_w = damerau_levenshtein_distance(post1, post2, level = 'w')
print("Damerau levenshtein edit distance (at word level): ", dled_w)
print("Damerau levenshtein edit distance (at char level): ", dled_c)


Damerau levenshtein edit distance (at word level):  16
Damerau levenshtein edit distance (at char level):  71


# 4. Util Function
This section provides a utility function to calculate edit distances between text pairs from a CSV file. The function can handle files with either one or two columns:  
- If two columns are provided, each row is treated as a text pair.  
- If one column is provided, all possible pairs are compared.  
You can specify which distance metric to use (simple, Levenshtein, or Damerau-Levenshtein) and the level (character or word). This makes it easy to batch process and analyze text similarity in your own datasets

In [12]:
import csv

def _pair_texts(pairs, method, level):
    results = []
    for t1, t2 in pairs:
        res = {}
        if method in ('simple', 'all'):
            res['simple'] = simple_edit_distance(t1, t2, level=level)
        if method in ('levenshtein', 'all'):
            res['levenshtein'] = levenshtein_edit_distance(t1, t2, level=level)
        if method in ('damerau', 'all'):
            res['damerau'] = damerau_levenshtein_distance(t1, t2, level=level)
        results.append((t1, t2, res))
    return results


def batch_edit_distance(csv_path=None, texts=None, method='all', level='w'):
    """
    Calculate edit distances for text pairs from a CSV file or a list of texts.
    
    Args:
        csv_path (str, optional): Path to the CSV file. If not provided, 'texts' must be given.
        texts (list, optional): List of texts to compare. Used if csv_path is not provided.
        method (str): 'simple', 'levenshtein', 'damerau', or 'all' (default).
        level (str): 'c' for character, 'w' for word (default 'c').
        
    Returns:
        results (list): List of tuples (text1, text2, {distance results})
    """
    # Load texts from CSV if path is given
        # Process pairs
    outputs = []

    try:
        with open(csv_path, newline='', encoding='utf-8') as f:
            reader = csv.reader(f)
            rows = list(reader)
        # Flatten if single column, else treat as pairs
        if len(rows[0]) == 1:
            texts = [row[0] for row in rows]
            pairs_csv = [(texts[i], texts[j]) for i in range(len(texts)) for j in range(i+1, len(texts))]
        else:
            pairs_csv = [(row[0], row[1]) for row in rows]
        outputs.extend(_pair_texts(pairs_csv, method, level))

    except (FileNotFoundError, ValueError):
        raise FileNotFoundError(f"CSV file not found or invalid format: {csv_path}") 
        
    try:
        pairs_texts = [(texts[i], texts[j]) for i in range(len(texts)) for j in range(i+1, len(texts))]
        outputs.extend(_pair_texts(pairs_texts, method, level))
    except (TypeError, IndexError):
        print("A Pair of texts or text object must be provided in form of a list")
    return outputs

# Example usage:
results = batch_edit_distance(csv_path='text_pair.csv', method='all', level='c')
for t1, t2, res in results:
    print(f"Text1: {t1}\nText2: {t2}\nDistances: {res}\n")

A Pair of texts or text object must be provided in form of a list
Text1: The sun rises in the east
Text2: The sun sets in the west
Distances: {'simple': 5, 'levenshtein': 5, 'damerau': 5}

Text1: He likes to play football
Text2: He loves to play soccer
Distances: {'simple': 9, 'levenshtein': 16, 'damerau': 9}

Text1: She made a cup of tea
Text2: She prepared a cup of coffee
Distances: {'simple': 11, 'levenshtein': 15, 'damerau': 11}

Text1: They visited the museum
Text2: They toured the art gallery
Distances: {'simple': 15, 'levenshtein': 22, 'damerau': 15}

Text1: Reading helps improve vocabulary
Text2: Reading enhances language skills
Distances: {'simple': 22, 'levenshtein': 34, 'damerau': 22}



## 5. (Bonus) Text Distortion
It use randomly selected operations (insertion, deletion, substitution and transposition) for given number of revolutions/spins to randomize/distort the text. The method would particularly be helpful in losing personalized information e.g., social media posts with people names and affiliations.

The following examples shows the original text with 10, 50 and 100 spins of distortion.

In [7]:
# 4. Randomize text
randomized_10 = randomize_text(post1, 10) # default revs = 1
randomized_50 = randomize_text(post1, 50) # default revs = 1
randomized_100 = randomize_text(post1, 100) # default revs = 1
print("Original post: ", post1)
print("Randomize text (10 revs): ", randomized_10) 
print("Randomize text (50 revs):", randomized_50) 
print("Randomize text (100 revs): ", randomized_100) 

Original post:  My page should be protected first so that worthless scum like you can't keep vandalizing it.
Randomize text (10 revs):  ynage sould Pbe protected first so thHtworthless scumlike you cant keep vandalziing it.
Randomize text (50 revs): My paTgs hJdlgepaodrctgd firgtjC s thwato rNvhlesss czml ikCe youcH Wan't keeT ndaliing t.
Randomize text (100 revs):  ytyMZg TtsAuuoWdb ckroteVdfPsrt  othkCw ohtlesss crmkHabuG NFcants'UkepandINialviIig yt.
