# Distances Module API Reference

The `distfeat.distances` module provides comprehensive phonetic distance calculation functionality. This module handles:

- Calculating distances between individual phonemes
- Building distance matrices for phoneme sets
- Multiple distance metrics (Hamming, Jaccard, Euclidean, Cosine, Manhattan, K-means)
- Custom distance method registration
- Performance optimization with caching
- Matrix-based operations for large-scale analysis

## Quick Start

```python
from distfeat import calculate_distance, build_distance_matrix, available_distance_methods

# Calculate distance between two phonemes
distance = calculate_distance('p', 'b', method='hamming')
print(f"Distance p-b: {distance:.3f}")

# Build distance matrix for a set of phonemes
matrix, phonemes = build_distance_matrix(['p', 'b', 't', 'd'], method='euclidean')
print(f"Matrix shape: {matrix.shape}")

# List available methods
methods = available_distance_methods()
print(f"Available methods: {methods}")
```

## Core Functions

### calculate_distance()

Calculate the phonetic distance between two phonemes using various metrics.

**Signature:**
```python
calculate_distance(
    phoneme1: str,
    phoneme2: str,
    method: str = 'hamming',
    normalize: bool = True,
    on_error: str = 'warn',
    **kwargs
) -> Optional[float]
```

**Parameters:**
- `phoneme1` (str): First IPA phoneme symbol
- `phoneme2` (str): Second IPA phoneme symbol
- `method` (str): Distance calculation method - 'hamming', 'jaccard', 'euclidean', 'cosine', 'manhattan', 'kmeans'
- `normalize` (bool): Normalize distance to [0, 1] range (default: True)
- `on_error` (str): Error handling mode - 'raise', 'warn', or 'ignore'
- `**kwargs`: Additional arguments for specific methods (e.g., n_clusters for k-means)

**Returns:**
- `float` or `None`: Distance value in [0, 1] range (if normalized), or None if phonemes not found

**Caching:**
Function uses `@lru_cache(maxsize=4096)` for performance optimization.

**Distance Methods:**
- **Hamming**: Count of differing binary features
- **Jaccard**: 1 - (intersection / union) of positive features
- **Euclidean**: L2 norm of feature vector difference
- **Cosine**: 1 - cosine similarity of feature vectors
- **Manhattan**: L1 norm (sum of absolute differences)
- **K-means**: Distance based on cluster centroids

In [None]:
# Examples of calculate_distance()
import sys
sys.path.append('/home/tiagot/tiatre/unipa/distfeat')
from distfeat import calculate_distance

# Compare all distance methods for p vs b (voicing difference)
phoneme_pair = ('p', 'b')
methods = ['hamming', 'jaccard', 'euclidean', 'cosine', 'manhattan']

print(f"Distance between /{phoneme_pair[0]}/ and /{phoneme_pair[1]}/ (voicing difference):")
print("-" * 60)

distances = {}
for method in methods:
    dist = calculate_distance(phoneme_pair[0], phoneme_pair[1], method=method)
    distances[method] = dist
    print(f"{method:<12}: {dist:.4f}")

print("\n" + "="*60 + "\n")

# Compare different phoneme pairs with Hamming distance
pairs = [('p', 'b'), ('p', 't'), ('p', 'k'), ('a', 'i'), ('s', 'ʃ')]
print("Hamming distances for various phoneme pairs:")
print("-" * 45)

for p1, p2 in pairs:
    dist = calculate_distance(p1, p2, method='hamming')
    if dist is not None:
        print(f"/{p1}/ - /{p2}/: {dist:.3f}")
    else:
        print(f"/{p1}/ - /{p2}/: phoneme not found")

In [None]:
# Advanced distance method examples

# K-means clustering distance with different cluster numbers
print("K-means distance with different cluster numbers:")
print("-" * 50)

for n_clusters in [5, 10, 15, 20]:
    dist = calculate_distance('p', 't', method='kmeans', n_clusters=n_clusters)
    if dist is not None:
        print(f"n_clusters={n_clusters:2d}: {dist:.4f}")

print("\n" + "="*60 + "\n")

# Normalization effects
phoneme_pair = ('p', 'a')  # Very different phonemes
methods = ['hamming', 'euclidean', 'manhattan']

print(f"Normalization effects for /{phoneme_pair[0]}/ vs /{phoneme_pair[1]}/:")
print("-" * 55)
print(f"{'Method':<12} {'Normalized':<12} {'Raw':<12}")

for method in methods:
    normalized = calculate_distance(phoneme_pair[0], phoneme_pair[1], 
                                  method=method, normalize=True)
    raw = calculate_distance(phoneme_pair[0], phoneme_pair[1], 
                           method=method, normalize=False)
    
    if normalized is not None and raw is not None:
        print(f"{method:<12} {normalized:<12.4f} {raw:<12.4f}")

### build_distance_matrix()

Build a symmetric distance matrix for a set of phonemes.

**Signature:**
```python
build_distance_matrix(
    phonemes: Optional[List[str]] = None,
    method: str = 'hamming',
    normalize: bool = True,
    n_clusters: Optional[int] = None,
    cache: bool = True
) -> Tuple[np.ndarray, List[str]]
```

**Parameters:**
- `phonemes` (Optional[List[str]]): List of phonemes to include (None for all phonemes in system)
- `method` (str): Distance method to use
- `normalize` (bool): Normalize distances to [0, 1] range
- `n_clusters` (Optional[int]): Number of clusters for k-means method
- `cache` (bool): Use cached distance calculations for better performance

**Returns:**
- `Tuple[np.ndarray, List[str]]`: (distance_matrix, phoneme_list)
  - `distance_matrix`: Square symmetric matrix of shape (n_phonemes, n_phonemes)
  - `phoneme_list`: Ordered list of phonemes corresponding to matrix rows/columns

**Properties:**
- Matrix is symmetric: `matrix[i,j] = matrix[j,i]`
- Diagonal is zero: `matrix[i,i] = 0`
- Missing phonemes get maximum distance (1.0 if normalized, inf otherwise)
- For large phoneme sets, caching provides significant speedup

In [None]:
# Examples of build_distance_matrix()
from distfeat import build_distance_matrix
import numpy as np

# Small matrix for consonant stops
stops = ['p', 'b', 't', 'd', 'k', 'g']
matrix, phoneme_list = build_distance_matrix(stops, method='hamming')

print("Distance matrix for stops (Hamming):")
print("-" * 40)
print(f"Shape: {matrix.shape}")
print(f"Phonemes: {phoneme_list}")

# Display matrix with phoneme labels
print("\nMatrix values:")
print(f"{'':>4}", end="")
for p in phoneme_list:
    print(f"{p:>6}", end="")
print()

for i, p1 in enumerate(phoneme_list):
    print(f"{p1:>4}", end="")
    for j, p2 in enumerate(phoneme_list):
        print(f"{matrix[i,j]:>6.3f}", end="")
    print()

print("\n" + "="*60 + "\n")

# Matrix properties verification
print("Matrix properties:")
print(f"  Symmetric: {np.allclose(matrix, matrix.T)}")
print(f"  Zero diagonal: {np.allclose(np.diag(matrix), 0)}")
print(f"  Min value: {np.min(matrix):.4f}")
print(f"  Max value: {np.max(matrix):.4f}")
print(f"  Mean distance: {np.mean(matrix[np.triu_indices_from(matrix, k=1)]):.4f}")

In [None]:
# Comparing different distance methods on same phoneme set
vowels = ['a', 'e', 'i', 'o', 'u']
methods = ['hamming', 'jaccard', 'euclidean', 'cosine']

print("Vowel distance matrices by method:")
print("=" * 50)

for method in methods:
    matrix, phonemes = build_distance_matrix(vowels, method=method)
    
    print(f"\n{method.upper()} distances:")
    print(f"{'':>4}", end="")
    for p in phonemes:
        print(f"{p:>6}", end="")
    print()
    
    for i, p1 in enumerate(phonemes):
        print(f"{p1:>4}", end="")
        for j, p2 in enumerate(phonemes):
            print(f"{matrix[i,j]:>6.3f}", end="")
        print()
    
    # Summary statistics
    upper_tri = matrix[np.triu_indices_from(matrix, k=1)]
    print(f"    Mean: {np.mean(upper_tri):.3f}, Std: {np.std(upper_tri):.3f}")

In [None]:
# Performance comparison: cached vs uncached
import time

test_phonemes = ['p', 'b', 't', 'd', 'k', 'g', 'm', 'n', 'ŋ', 'f', 'v', 's', 'z']

print("Performance comparison (cached vs uncached):")
print("-" * 50)

# Cached version
start = time.time()
matrix_cached, _ = build_distance_matrix(test_phonemes, method='hamming', cache=True)
cached_time = time.time() - start

# Uncached version
start = time.time()
matrix_uncached, _ = build_distance_matrix(test_phonemes, method='hamming', cache=False)
uncached_time = time.time() - start

print(f"Cached time:   {cached_time:.6f} seconds")
print(f"Uncached time: {uncached_time:.6f} seconds")
print(f"Speedup:       {uncached_time / cached_time:.1f}x")
print(f"Results match: {np.allclose(matrix_cached, matrix_uncached)}")

print(f"\nMatrix size: {len(test_phonemes)} phonemes = {len(test_phonemes)**2} distances")
print(f"Unique distances: {len(test_phonemes) * (len(test_phonemes) - 1) // 2}")

### available_distance_methods()

Get a list of all available distance calculation methods.

**Signature:**
```python
available_distance_methods() -> List[str]
```

**Returns:**
- `List[str]`: List of method names (built-in + custom registered methods)

**Built-in Methods:**
- `'hamming'`: Binary feature difference count
- `'jaccard'`: Set-based similarity measure
- `'euclidean'`: L2 geometric distance
- `'cosine'`: Angular similarity measure
- `'manhattan'`: L1 city-block distance
- `'kmeans'`: Cluster-based distance

**Custom Methods:**
Methods registered via `register_distance_method()`

In [None]:
# Examples of available_distance_methods()
from distfeat import available_distance_methods

# List all available methods
methods = available_distance_methods()
print(f"Available distance methods ({len(methods)} total):")
print("-" * 40)

for i, method in enumerate(methods, 1):
    print(f"{i:2d}. {method}")

print("\n" + "="*60 + "\n")

# Method characteristics
method_info = {
    'hamming': 'Binary difference count (good for categorical features)',
    'jaccard': 'Set similarity (emphasizes shared positive features)',
    'euclidean': 'Geometric distance (sensitive to feature magnitude)',
    'cosine': 'Angular similarity (normalized by vector length)',
    'manhattan': 'City-block distance (robust to outliers)',
    'kmeans': 'Cluster-based distance (data-driven grouping)'
}

print("Method characteristics:")
print("-" * 70)

for method in methods:
    if method in method_info:
        print(f"{method:<10}: {method_info[method]}")
    else:
        print(f"{method:<10}: Custom registered method")

### register_distance_method()

Register a custom distance calculation method.

**Signature:**
```python
register_distance_method(name: str, func: Callable) -> None
```

**Parameters:**
- `name` (str): Name for the distance method (used in `calculate_distance()`)
- `func` (Callable): Function that takes two numpy arrays and returns a float distance

**Function Requirements:**
```python
def custom_distance(vec1: np.ndarray, vec2: np.ndarray) -> float:
    # vec1, vec2 are binary feature vectors (0s and 1s)
    # Return non-negative distance value
    pass
```

**Notes:**
- Custom functions receive normalized binary feature vectors
- Normalization (if requested) is handled automatically
- Function should return non-negative values
- Name must not conflict with built-in methods

In [None]:
# Examples of register_distance_method()
from distfeat import register_distance_method, calculate_distance
import numpy as np

# Define custom distance methods

def dice_distance(vec1: np.ndarray, vec2: np.ndarray) -> float:
    """Dice coefficient distance (similar to Jaccard but different formula)."""
    intersection = np.sum((vec1 == 1) & (vec2 == 1))
    total_positive = np.sum(vec1) + np.sum(vec2)
    
    if total_positive == 0:
        return 0.0
    
    dice_coeff = (2 * intersection) / total_positive
    return 1.0 - dice_coeff

def weighted_hamming(vec1: np.ndarray, vec2: np.ndarray) -> float:
    """Weighted Hamming distance (early features matter more)."""
    differences = (vec1 != vec2).astype(float)
    # Give higher weights to earlier features
    weights = np.exp(-np.arange(len(vec1)) / 10)
    return np.sum(differences * weights) / np.sum(weights)

def overlap_distance(vec1: np.ndarray, vec2: np.ndarray) -> float:
    """Simple overlap distance (fraction of non-matching features)."""
    if len(vec1) == 0:
        return 0.0
    return np.mean(vec1 != vec2)

# Register the custom methods
print("Registering custom distance methods:")
register_distance_method('dice', dice_distance)
register_distance_method('weighted_hamming', weighted_hamming)
register_distance_method('overlap', overlap_distance)

print("\n" + "="*60 + "\n")

# Test custom methods
test_pairs = [('p', 'b'), ('p', 't'), ('a', 'i')]
custom_methods = ['dice', 'weighted_hamming', 'overlap']

print("Testing custom distance methods:")
print("-" * 50)
print(f"{'Pair':<8} {'Dice':<8} {'W.Ham':<8} {'Overlap':<8}")

for p1, p2 in test_pairs:
    results = []
    for method in custom_methods:
        dist = calculate_distance(p1, p2, method=method)
        results.append(f"{dist:.3f}" if dist is not None else "N/A")
    
    print(f"{p1}-{p2:<5} {results[0]:<8} {results[1]:<8} {results[2]:<8}")

# Verify methods are available
print("\nUpdated available methods:")
updated_methods = available_distance_methods()
custom_count = len(updated_methods) - 6  # 6 built-in methods
print(f"Total methods: {len(updated_methods)} ({custom_count} custom)")
print(f"Custom methods: {[m for m in updated_methods if m not in ['hamming', 'jaccard', 'euclidean', 'cosine', 'manhattan', 'kmeans']]}")

## Advanced Usage

### Phonetic Distance Analysis

Analyze phonetic relationships using distance matrices:

In [None]:
# Advanced: Phonetic distance analysis
import numpy as np
from distfeat import build_distance_matrix

def analyze_phonetic_groups(phonemes, group_labels, method='hamming'):
    """Analyze within-group vs between-group distances."""
    matrix, phoneme_list = build_distance_matrix(phonemes, method=method)
    
    # Create group mapping
    phoneme_to_group = dict(zip(phonemes, group_labels))
    
    within_group = []
    between_group = []
    
    for i, p1 in enumerate(phoneme_list):
        for j, p2 in enumerate(phoneme_list):
            if i < j:  # Upper triangle only
                distance = matrix[i, j]
                if phoneme_to_group[p1] == phoneme_to_group[p2]:
                    within_group.append(distance)
                else:
                    between_group.append(distance)
    
    return {
        'within_mean': np.mean(within_group),
        'within_std': np.std(within_group),
        'between_mean': np.mean(between_group),
        'between_std': np.std(between_group),
        'separation_ratio': np.mean(between_group) / np.mean(within_group),
        'within_distances': within_group,
        'between_distances': between_group
    }

# Test with stop consonants grouped by voicing
stops = ['p', 'b', 't', 'd', 'k', 'g']
voicing = ['voiceless', 'voiced', 'voiceless', 'voiced', 'voiceless', 'voiced']

analysis = analyze_phonetic_groups(stops, voicing, method='hamming')

print("Voicing-based grouping analysis (stops):")
print("=" * 45)
print(f"Within-group (same voicing):")
print(f"  Mean distance: {analysis['within_mean']:.4f}")
print(f"  Std deviation: {analysis['within_std']:.4f}")
print(f"\nBetween-group (different voicing):")
print(f"  Mean distance: {analysis['between_mean']:.4f}")
print(f"  Std deviation: {analysis['between_std']:.4f}")
print(f"\nSeparation ratio: {analysis['separation_ratio']:.2f}")
print(f"(Higher ratio = better group separation)")

print("\n" + "="*60 + "\n")

# Test with place of articulation grouping
place_groups = ['labial', 'labial', 'coronal', 'coronal', 'dorsal', 'dorsal']
place_analysis = analyze_phonetic_groups(stops, place_groups, method='hamming')

print("Place-based grouping analysis (stops):")
print("=" * 42)
print(f"Within-group: {place_analysis['within_mean']:.4f} ± {place_analysis['within_std']:.4f}")
print(f"Between-group: {place_analysis['between_mean']:.4f} ± {place_analysis['between_std']:.4f}")
print(f"Separation ratio: {place_analysis['separation_ratio']:.2f}")

print("\nComparison:")
print(f"Voicing separation: {analysis['separation_ratio']:.2f}")
print(f"Place separation:   {place_analysis['separation_ratio']:.2f}")
better = "Voicing" if analysis['separation_ratio'] > place_analysis['separation_ratio'] else "Place"
print(f"Better grouping principle: {better}")

### Distance Matrix Visualization and Analysis

Tools for analyzing and visualizing distance matrices:

In [None]:
# Advanced: Distance matrix analysis
import numpy as np
from distfeat import build_distance_matrix

def matrix_statistics(matrix, phonemes):
    """Calculate comprehensive statistics for a distance matrix."""
    # Get upper triangle (unique distances)
    upper_tri = matrix[np.triu_indices_from(matrix, k=1)]
    
    # Basic statistics
    stats = {
        'size': matrix.shape[0],
        'total_distances': len(upper_tri),
        'min_distance': np.min(upper_tri),
        'max_distance': np.max(upper_tri),
        'mean_distance': np.mean(upper_tri),
        'median_distance': np.median(upper_tri),
        'std_distance': np.std(upper_tri),
    }
    
    # Find most similar and dissimilar pairs
    min_idx = np.unravel_index(np.argmin(matrix + np.eye(len(matrix))), matrix.shape)
    max_idx = np.unravel_index(np.argmax(matrix), matrix.shape)
    
    stats['most_similar'] = (phonemes[min_idx[0]], phonemes[min_idx[1]], matrix[min_idx])
    stats['most_dissimilar'] = (phonemes[max_idx[0]], phonemes[max_idx[1]], matrix[max_idx])
    
    # Distance distribution
    hist, bins = np.histogram(upper_tri, bins=10)
    stats['distribution'] = list(zip(bins[:-1], bins[1:], hist))
    
    return stats

def compare_distance_methods(phonemes, methods=['hamming', 'jaccard', 'euclidean']):
    """Compare different distance methods on the same phoneme set."""
    results = {}
    
    for method in methods:
        matrix, phoneme_list = build_distance_matrix(phonemes, method=method)
        stats = matrix_statistics(matrix, phoneme_list)
        results[method] = stats
    
    return results

# Test with consonants
consonants = ['p', 'b', 't', 'd', 'k', 'g', 'm', 'n', 'ŋ', 'f', 'v', 's', 'z']
comparison = compare_distance_methods(consonants)

print("Distance method comparison (consonants):")
print("=" * 60)
print(f"{'Method':<12} {'Mean':<8} {'Std':<8} {'Min':<8} {'Max':<8}")
print("-" * 60)

for method, stats in comparison.items():
    print(f"{method:<12} {stats['mean_distance']:<8.3f} {stats['std_distance']:<8.3f} "
          f"{stats['min_distance']:<8.3f} {stats['max_distance']:<8.3f}")

print("\n" + "="*60 + "\n")

# Most similar/dissimilar pairs by method
print("Most similar and dissimilar pairs:")
print("-" * 45)

for method, stats in comparison.items():
    similar = stats['most_similar']
    dissimilar = stats['most_dissimilar']
    
    print(f"\n{method.upper()}:")
    print(f"  Most similar:    {similar[0]}-{similar[1]} ({similar[2]:.3f})")
    print(f"  Most dissimilar: {dissimilar[0]}-{dissimilar[1]} ({dissimilar[2]:.3f})")

# Matrix correlations
print("\n" + "="*60 + "\n")
print("Method correlations:")
print("-" * 30)

methods = list(comparison.keys())
matrices = {}
for method in methods:
    matrix, _ = build_distance_matrix(consonants, method=method)
    matrices[method] = matrix[np.triu_indices_from(matrix, k=1)]

for i, method1 in enumerate(methods):
    for method2 in methods[i+1:]:
        correlation = np.corrcoef(matrices[method1], matrices[method2])[0, 1]
        print(f"{method1} - {method2}: {correlation:.3f}")

### Performance Optimization

Techniques for optimizing distance calculations:

In [None]:
# Performance optimization examples
import time
from distfeat import calculate_distance, build_distance_matrix

def benchmark_methods(phonemes, methods=['hamming', 'euclidean', 'cosine'], iterations=3):
    """Benchmark different distance methods."""
    results = {}
    
    for method in methods:
        times = []
        
        for _ in range(iterations):
            start = time.time()
            build_distance_matrix(phonemes, method=method)
            times.append(time.time() - start)
        
        results[method] = {
            'mean_time': np.mean(times),
            'std_time': np.std(times),
            'min_time': np.min(times),
            'max_time': np.max(times)
        }
    
    return results

def cache_performance_test():
    """Test LRU cache performance."""
    # Clear cache
    calculate_distance.cache_clear()
    
    test_pairs = [('p', 'b'), ('t', 'd'), ('k', 'g')] * 100
    
    # First run (cache misses)
    start = time.time()
    for p1, p2 in test_pairs:
        calculate_distance(p1, p2)
    first_run = time.time() - start
    
    # Second run (cache hits)
    start = time.time()
    for p1, p2 in test_pairs:
        calculate_distance(p1, p2)
    second_run = time.time() - start
    
    cache_info = calculate_distance.cache_info()
    
    return {
        'first_run': first_run,
        'second_run': second_run,
        'speedup': first_run / second_run,
        'cache_info': cache_info
    }

# Benchmark different methods
test_phonemes = ['p', 'b', 't', 'd', 'k', 'g', 'm', 'n', 'f', 'v']
bench_results = benchmark_methods(test_phonemes)

print(f"Performance benchmark ({len(test_phonemes)} phonemes):")
print("=" * 55)
print(f"{'Method':<12} {'Mean (ms)':<12} {'Std (ms)':<12} {'Min (ms)':<12}")
print("-" * 55)

for method, stats in bench_results.items():
    print(f"{method:<12} {stats['mean_time']*1000:<12.2f} "
          f"{stats['std_time']*1000:<12.2f} {stats['min_time']*1000:<12.2f}")

# Find fastest method
fastest = min(bench_results.items(), key=lambda x: x[1]['mean_time'])
print(f"\nFastest method: {fastest[0]} ({fastest[1]['mean_time']*1000:.2f} ms)")

print("\n" + "="*60 + "\n")

# Cache performance
cache_results = cache_performance_test()

print("Cache performance test:")
print("-" * 30)
print(f"First run (cache misses):  {cache_results['first_run']*1000:.2f} ms")
print(f"Second run (cache hits):   {cache_results['second_run']*1000:.2f} ms")
print(f"Cache speedup:             {cache_results['speedup']:.1f}x")
print(f"\nCache statistics:")
info = cache_results['cache_info']
print(f"  Hits: {info.hits}, Misses: {info.misses}")
print(f"  Hit rate: {info.hits / (info.hits + info.misses) * 100:.1f}%")
print(f"  Cache size: {info.currsize}/{info.maxsize}")

## Distance Method Details

### Mathematical Formulations

Each distance method implements a specific mathematical formula:

**Hamming Distance:**
```
hamming(x, y) = Σᵢ (xᵢ ≠ yᵢ) / n
```
Counts the fraction of positions where binary features differ.

**Jaccard Distance:**
```
jaccard(x, y) = 1 - |A ∩ B| / |A ∪ B|
```
Where A and B are sets of positive features in x and y.

**Euclidean Distance:**
```
euclidean(x, y) = √(Σᵢ (xᵢ - yᵢ)²) / √n
```
L2 norm normalized by dimension.

**Cosine Distance:**
```
cosine(x, y) = 1 - (x·y) / (||x|| ||y||)
```
Angular distance between feature vectors.

**Manhattan Distance:**
```
manhattan(x, y) = Σᵢ |xᵢ - yᵢ| / n
```
L1 norm normalized by dimension.

**K-means Distance:**
```
kmeans(x, y) = ||c(x) - c(y)|| / max_dist
```
Where c(x) is the centroid of the cluster containing phoneme x.

In [None]:
# Mathematical verification of distance methods
from distfeat import calculate_distance, phoneme_to_features, get_feature_names
import numpy as np

def manual_hamming(p1, p2):
    """Manual Hamming distance calculation for verification."""
    f1 = phoneme_to_features(p1)
    f2 = phoneme_to_features(p2)
    if f1 is None or f2 is None:
        return None
    
    feature_names = get_feature_names()
    differences = sum(1 for f in feature_names if f1.get(f, 0) != f2.get(f, 0))
    return differences / len(feature_names)

def manual_jaccard(p1, p2):
    """Manual Jaccard distance calculation for verification."""
    f1 = phoneme_to_features(p1)
    f2 = phoneme_to_features(p2)
    if f1 is None or f2 is None:
        return None
    
    # Get positive features
    pos1 = set(f for f, v in f1.items() if v == 1)
    pos2 = set(f for f, v in f2.items() if v == 1)
    
    intersection = len(pos1 & pos2)
    union = len(pos1 | pos2)
    
    if union == 0:
        return 0.0
    
    return 1.0 - (intersection / union)

# Verification tests
test_pairs = [('p', 'b'), ('p', 't'), ('a', 'i')]

print("Mathematical verification:")
print("=" * 50)
print(f"{'Pair':<6} {'Method':<8} {'Library':<8} {'Manual':<8} {'Match':<8}")
print("-" * 50)

for p1, p2 in test_pairs:
    # Hamming
    lib_hamming = calculate_distance(p1, p2, method='hamming')
    man_hamming = manual_hamming(p1, p2)
    
    if lib_hamming is not None and man_hamming is not None:
        hamming_match = abs(lib_hamming - man_hamming) < 1e-6
        print(f"{p1}-{p2:<3} {'hamming':<8} {lib_hamming:<8.4f} {man_hamming:<8.4f} {hamming_match!s:<8}")
    
    # Jaccard
    lib_jaccard = calculate_distance(p1, p2, method='jaccard')
    man_jaccard = manual_jaccard(p1, p2)
    
    if lib_jaccard is not None and man_jaccard is not None:
        jaccard_match = abs(lib_jaccard - man_jaccard) < 1e-6
        print(f"{' ':<6} {'jaccard':<8} {lib_jaccard:<8.4f} {man_jaccard:<8.4f} {jaccard_match!s:<8}")

print("\n" + "="*60 + "\n")

# Distance properties verification
print("Distance metric properties:")
print("-" * 40)

methods = ['hamming', 'jaccard', 'euclidean', 'cosine', 'manhattan']
test_phonemes = ['p', 'b', 't']

for method in methods:
    print(f"\n{method.upper()}:")
    
    # Identity: d(x,x) = 0
    identity = all(calculate_distance(p, p, method=method) == 0.0 for p in test_phonemes)
    print(f"  Identity: {identity}")
    
    # Symmetry: d(x,y) = d(y,x)
    symmetry = all(
        abs(calculate_distance(p1, p2, method=method) - 
            calculate_distance(p2, p1, method=method)) < 1e-6
        for p1 in test_phonemes for p2 in test_phonemes
    )
    print(f"  Symmetry: {symmetry}")
    
    # Non-negativity: d(x,y) >= 0
    non_negative = all(
        calculate_distance(p1, p2, method=method) >= 0
        for p1 in test_phonemes for p2 in test_phonemes
    )
    print(f"  Non-negative: {non_negative}")

## Error Handling and Edge Cases

The distances module provides robust error handling:

In [None]:
# Error handling examples
from distfeat import calculate_distance, build_distance_matrix
import warnings

print("Error handling scenarios:")
print("=" * 40)

# 1. Unknown phonemes
print("\n1. Unknown phonemes:")
for error_mode in ['warn', 'ignore', 'raise']:
    try:
        with warnings.catch_warnings(record=True) as w:
            warnings.simplefilter("always")
            result = calculate_distance('p', 'xyz', on_error=error_mode)
            warn_count = len(w)
        print(f"   {error_mode:<6}: result={result}, warnings={warn_count}")
    except Exception as e:
        print(f"   {error_mode:<6}: {type(e).__name__} - {str(e)[:40]}...")

# 2. Unknown distance method
print("\n2. Unknown distance method:")
try:
    calculate_distance('p', 'b', method='nonexistent')
except ValueError as e:
    print(f"   ValueError: {e}")

# 3. Empty phoneme list
print("\n3. Empty phoneme list:")
matrix, phonemes = build_distance_matrix([])
print(f"   Matrix shape: {matrix.shape}")
print(f"   Phoneme list: {phonemes}")

# 4. Single phoneme
print("\n4. Single phoneme:")
matrix, phonemes = build_distance_matrix(['p'])
print(f"   Matrix shape: {matrix.shape}")
print(f"   Matrix content: {matrix[0,0]}")

# 5. Mixed valid/invalid phonemes
print("\n5. Mixed valid/invalid phonemes:")
mixed_phonemes = ['p', 'xyz', 'b', 'abc']
matrix, valid_phonemes = build_distance_matrix(mixed_phonemes)
print(f"   Input: {mixed_phonemes}")
print(f"   Output phonemes: {valid_phonemes}")
print(f"   Matrix shape: {matrix.shape}")

print("\n" + "="*60 + "\n")

# Edge cases for specific distance methods
print("Distance method edge cases:")
print("-" * 35)

# Zero vectors (if they exist)
print("\n1. Identical phonemes (should be 0.0):")
for method in ['hamming', 'jaccard', 'euclidean', 'cosine', 'manhattan']:
    dist = calculate_distance('p', 'p', method=method)
    print(f"   {method:<10}: {dist}")

# Very different phonemes
print("\n2. Very different phonemes (consonant vs vowel):")
for method in ['hamming', 'jaccard', 'euclidean', 'cosine', 'manhattan']:
    dist = calculate_distance('p', 'a', method=method)
    if dist is not None:
        print(f"   {method:<10}: {dist:.4f}")

# Performance with large matrices
print("\n3. Large matrix handling:")
import time

# Test with 20 phonemes (190 unique distances)
large_set = ['p', 'b', 't', 'd', 'k', 'g', 'm', 'n', 'ŋ', 'f', 
             'v', 's', 'z', 'ʃ', 'ʒ', 'x', 'ɣ', 'h', 'l', 'r']

start = time.time()
matrix, phonemes = build_distance_matrix(large_set, method='hamming')
elapsed = time.time() - start

unique_distances = len(large_set) * (len(large_set) - 1) // 2
print(f"   {len(large_set)} phonemes = {unique_distances} unique distances")
print(f"   Computation time: {elapsed:.4f} seconds")
print(f"   Rate: {unique_distances/elapsed:.0f} distances/second")

## Integration Examples

### Working with Features Module

The distances module integrates seamlessly with the features system:

In [None]:
# Integration with features module
from distfeat import phoneme_to_features, calculate_distance, build_distance_matrix
import numpy as np

def feature_based_clustering(phonemes, n_clusters=3, method='hamming'):
    """Cluster phonemes based on distances."""
    matrix, phoneme_list = build_distance_matrix(phonemes, method=method)
    
    # Simple hierarchical clustering based on distances
    clusters = {i: [phoneme] for i, phoneme in enumerate(phoneme_list)}
    distances = matrix.copy()
    np.fill_diagonal(distances, np.inf)  # Ignore self-distances
    
    while len(clusters) > n_clusters:
        # Find closest pair
        min_idx = np.unravel_index(np.argmin(distances), distances.shape)
        i, j = min_idx
        
        # Merge clusters
        if i in clusters and j in clusters:
            clusters[i].extend(clusters[j])
            del clusters[j]
            
            # Update distances (set merged phonemes to infinite distance)
            distances[j, :] = np.inf
            distances[:, j] = np.inf
    
    return list(clusters.values())

# Test clustering
test_consonants = ['p', 'b', 't', 'd', 'k', 'g', 'm', 'n', 'ŋ']
clusters = feature_based_clustering(test_consonants, n_clusters=3)

print("Phoneme clustering based on Hamming distances:")
print("=" * 50)
for i, cluster in enumerate(clusters, 1):
    print(f"Cluster {i}: {cluster}")

print("\n" + "="*60 + "\n")

# Feature analysis of clusters
print("Feature analysis of clusters:")
print("-" * 35)

for i, cluster in enumerate(clusters, 1):
    print(f"\nCluster {i} ({', '.join(cluster)}):")
    
    # Get features for all phonemes in cluster
    all_features = [phoneme_to_features(p) for p in cluster]
    all_features = [f for f in all_features if f is not None]
    
    if all_features:
        # Find shared features (present in all phonemes)
        shared_features = set(all_features[0].keys())
        for features in all_features[1:]:
            shared_features &= set(features.keys())
        
        # Find common positive features
        common_positive = []
        for feature in shared_features:
            if all(f[feature] == 1 for f in all_features):
                common_positive.append(feature)
        
        print(f"   Common positive features ({len(common_positive)}):")
        if common_positive:
            # Show first 5 features
            shown = common_positive[:5]
            print(f"     {', '.join(shown)}")
            if len(common_positive) > 5:
                print(f"     ... and {len(common_positive) - 5} more")
        else:
            print("     None")

## Implementation Notes

### Caching Strategy

The distances module uses multiple levels of caching:

- **Function-level caching**: `calculate_distance()` uses `@lru_cache(maxsize=4096)`
- **Feature caching**: Features are cached by the features module
- **Matrix caching**: `build_distance_matrix()` can optionally use cached calculations

### Memory Usage

- Distance matrices are stored as `float64` numpy arrays
- For N phonemes, matrix requires 8*N² bytes
- Large matrices (N>1000) may require significant memory
- Consider computing distances on-demand for very large sets

### Numerical Stability

- All methods handle edge cases (zero vectors, identical vectors)
- Cosine distance includes special handling for floating-point precision
- Normalization prevents overflow in large feature spaces
- K-means clustering uses fixed random seed for reproducibility

### Extension Points

The module provides several ways to extend functionality:

1. **Custom distance methods** via `register_distance_method()`
2. **Custom normalization** by modifying normalization parameters
3. **Alternative clustering** by replacing k-means with other algorithms
4. **Parallel computation** by batching matrix calculations

## See Also

- **[Features Module API](features.ipynb)**: Core phonetic feature extraction
- **[Implementation Details](../chapters/03_implementation.ipynb)**: Internal architecture and algorithms
- **[Validation](../chapters/04_validation.ipynb)**: Testing and validation of distance methods
- **[User Guide](../chapters/01_quickstart.ipynb)**: Getting started with distfeat

## References

- **Distance Metrics**: Comprehensive survey of distance measures in pattern recognition
- **Phonetic Distance**: Kondrak (2000), McMahon & McMahon (2005)
- **Feature Geometry**: Clements & Hume (1995), Hall (2007)
- **Clustering**: Hastie et al. (2009), Bishop (2006)