# Language metrics

## Introduction

Language metrics are crucial in the field of Natural Language Processing (NLP) for measuring the similarity and differences between strings of text. One such metric is the Levenshtein distance, which calculates the minimum number of single-character edits (insertions, deletions, or substitutions) required to change one word into another.

## Levenshtein distance

In this context, the Levenshtein distance is calculated based on tokens (which could be words, subwords, or other meaningful units) instead of individual characters.

### Basic Example: Token-Based Levenshtein Distance

Let's begin with a simple token-based example, using whitespace as a basic tokenizer.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from Levenshtein import distance, editops

A = "kitten sitting"
B = "sitting kitten"


In [None]:
# Calculate the Levenshtein distance
dist = distance(A, B)

# Get the edit operations
edits = editops(A, B)

In [None]:
# Count the operations
substitutions = sum(1 for op in edits if op[0] == 'replace')
deletions = sum(1 for op in edits if op[0] == 'delete')
insertions = sum(1 for op in edits if op[0] == 'insert')
correct = len(A) - deletions - substitutions

print(f"Substitutions: {substitutions}")
print(f"Deletions: {deletions}")
print(f"Insertions: {insertions}")
print(f"Correct: {correct}")


In [None]:
# Function to create the distance matrix
def levenshtein_matrix(s1, s2):
    m, n = len(s1), len(s2)
    d = np.zeros((m + 1, n + 1))
    for i in range(m + 1):
        d[i, 0] = i
    for j in range(n + 1):
        d[0, j] = j
    for j in range(1, n + 1):
        for i in range(1, m + 1):
            if s1[i - 1] == s2[j - 1]:
                d[i, j] = d[i - 1, j - 1]
            else:
                d[i, j] = min(d[i - 1, j] + 1, d[i, j - 1] + 1, d[i - 1, j - 1] + 1)
    return d

In [None]:
# Plot the distance matrix
plt.imshow(matrix, cmap='viridis', origin='upper', interpolation='nearest')
plt.colorbar()
plt.xlabel('String B')
plt.ylabel('String A')
plt.title('Levenshtein Distance Matrix')
plt.show()

### Complex Example: Custom Tokenizer

For more complex examples, especially relevant to LLMs, we might need a tokenizer that goes beyond simple whitespace separation. 
Let's create an alternate tokenizer and use it for our Levenshtein distance comparison.

In [None]:
def custom_tokenizer(text):
    # Tokenize based on punctuation and whitespace
    import re
    tokens = re.findall(r'\b\w+\b|\S', text)
    return tokens

# Examples
A = "Renewable energy sources are essential for sustainable development."
B = "Sustainable development necessitates the use of renewable energy sources."

# Compute and print token-based Levenshtein distance
d = levenshtein(A, B, tokenizer=custom_tokenizer)

In [None]:
print(custom_tokenizer(A))
print(custom_tokenizer(B))

In [None]:
d.distance

In [None]:
print(f"Substitutions: {d.counters.substitutions}")
print(f"Deletions: {d.counters.deletions}")
print(f"Insertions: {d.counters.insertions}")
print(f"Correct: {d.counters.correct}")

In [None]:
d.matrix

In [None]:
d.plot()