# Chapter 1: Introduction

## The Challenge of Phonetic Distance

Historical linguists have long recognized that sound changes follow regular patterns. When Latin *pater* became Spanish *padre* and French *père*, the initial [p] remained unchanged, but when Latin *caput* became Spanish *cabo* and French *chef*, dramatic changes occurred. Understanding these changes requires a way to measure how "different" two sounds are from each other.

Consider these cognate sets from Indo-European languages:

In [None]:
# Example cognate sets showing sound correspondences
cognates = {
    'FATHER': {
        'English': ['f', 'ɑː', 'ð', 'ər'],
        'German': ['f', 'aː', 't', 'ər'],
        'Latin': ['p', 'a', 't', 'er'],
        'Sanskrit': ['p', 'i', 't', 'ar']
    },
    'FOOT': {
        'English': ['f', 'ʊ', 't'],
        'German': ['f', 'uː', 's'],
        'Latin': ['p', 'eː', 'd'],
        'Greek': ['p', 'o', 'd']
    }
}

# Display the cognates
for gloss, words in cognates.items():
    print(f"\n{gloss}:")
    for lang, segments in words.items():
        print(f"  {lang:10} {''.join(segments)}")

## Why Phonetic Features Matter

The pattern above illustrates Grimm's Law: Proto-Indo-European \*p became Germanic f. But why is this change "natural"? The answer lies in phonetic features:

In [None]:
from distfeat import phoneme_to_features, calculate_distance

# Compare features of p and f
p_features = phoneme_to_features('p')
f_features = phoneme_to_features('f')

# Find common and different features
common_features = []
different_features = []

for feature in p_features:
    if p_features[feature] == f_features[feature]:
        if p_features[feature] == 1:  # Only show active features
            common_features.append(feature)
    else:
        different_features.append(f"{feature}: p={p_features[feature]}, f={f_features[feature]}")

print("Common features (both sounds share):")
for feat in common_features[:5]:  # Show first 5
    print(f"  + {feat}")

print("\nDifferent features:")
for feat in different_features[:5]:  # Show first 5
    print(f"  - {feat}")

# Calculate distance
distance = calculate_distance('p', 'f', method='hamming')
print(f"\nHamming distance: {distance:.3f}")

## The Need for a Unified Framework

Despite the importance of phonetic distance in historical linguistics, existing tools have limitations:

### Current Approaches and Their Limitations

1. **Ad-hoc Distance Matrices**: Many studies use manually crafted distance matrices based on linguistic intuition
   - Not reproducible
   - Not extensible to new sounds
   - Difficult to validate

2. **Simple Edit Distance**: Treats all substitutions equally
   - [p] → [b] costs the same as [p] → [a]
   - Ignores phonetic similarity

3. **Sound Classes** (Dolgopolsky, ASJP): Groups similar sounds
   - Coarse-grained (typically 10-40 classes)
   - Loss of phonetic detail
   - Arbitrary boundaries

### Our Solution: distfeat

distfeat provides a principled, reproducible approach:

In [None]:
from distfeat import build_distance_matrix
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Build distance matrix for stops
stops = ['p', 'b', 't', 'd', 'k', 'g']
matrix, labels = build_distance_matrix(stops, method='hamming')

# Visualize the matrix
plt.figure(figsize=(8, 6))
sns.heatmap(matrix, 
            xticklabels=labels, 
            yticklabels=labels,
            annot=True, 
            fmt='.3f',
            cmap='YlOrRd',
            vmin=0, vmax=1)
plt.title('Phonetic Distance Matrix for Stops')
plt.tight_layout()
plt.show()

# Analyze the pattern
print("Observations:")
print(f"1. Voiced-voiceless pairs have small distances:")
for v, vl in [('p','b'), ('t','d'), ('k','g')]:
    dist = calculate_distance(v, vl)
    print(f"   {v}-{vl}: {dist:.3f}")

print(f"\n2. Different places of articulation have larger distances:")
for v, vl in [('p','t'), ('p','k'), ('t','k')]:
    dist = calculate_distance(v, vl)
    print(f"   {v}-{vl}: {dist:.3f}")

## Core Principles

### 1. Feature-Based Representation

Every phoneme is represented as a vector of binary features based on articulatory and acoustic properties:

In [None]:
from distfeat import get_feature_names

# Get all feature names
all_features = get_feature_names()
print(f"Total features: {len(all_features)}")

# Categorize features
categories = {
    'Major class': ['consonantal', 'sonorant', 'syllabic'],
    'Manner': ['continuant', 'nasal', 'strident', 'lateral', 'delayedrelease'],
    'Place': ['labial', 'coronal', 'dorsal', 'pharyngeal', 'glottal'],
    'Voicing': ['voice', 'spreadglottis', 'constrictedglottis'],
    'Secondary': ['round', 'anterior', 'distributed', 'high', 'low', 'back']
}

for category, features in categories.items():
    available = [f for f in features if f in all_features]
    print(f"\n{category} features: {', '.join(available)}")

### 2. Multiple Distance Metrics

Different linguistic questions require different distance metrics:

In [None]:
from distfeat import available_distance_methods

# Compare different metrics
methods = ['hamming', 'jaccard', 'euclidean', 'cosine', 'manhattan']
phoneme_pairs = [('p', 'b'), ('p', 'f'), ('p', 'k'), ('p', 'a')]

print("Distance metrics comparison:\n")
print("Pair    ", end="")
for method in methods:
    print(f"{method:>10}", end="")
print()
print("-" * 65)

for p1, p2 in phoneme_pairs:
    print(f"{p1}-{p2}     ", end="")
    for method in methods:
        dist = calculate_distance(p1, p2, method=method)
        if dist is not None:
            print(f"{dist:10.3f}", end="")
        else:
            print(f"{'N/A':>10}", end="")
    print()

### 3. Validation Against Cognate Data

A key innovation is using cognate data to validate and optimize distance metrics:

In [None]:
from distfeat.alignment import align_sequences

# Example: Germanic cognates for "water"
water_cognates = [
    ['w', 'ɔː', 't', 'ər'],  # English water
    ['v', 'a', 's', 'ər'],   # German Wasser
    ['w', 'aː', 't', 'ər'],  # Dutch water
]

# Align first two cognates
result = align_sequences(water_cognates[0], water_cognates[1])
print("Alignment of English 'water' and German 'Wasser':")
print(f"English: {' '.join(result.seq1_aligned)}")
print(f"German:  {' '.join(result.seq2_aligned)}")
print(f"Distance: {result.normalized_distance:.3f}")

# Compare with non-cognate
non_cognate = ['m', 'a', 'ʁ']  # French "mer" (sea)
result2 = align_sequences(water_cognates[0], non_cognate)
print(f"\nDistance to non-cognate: {result2.normalized_distance:.3f}")
print("\nCognates should have lower distances than non-cognates ✓")

## Organization of This Documentation

This documentation is organized to serve both as a user guide and as the foundation for an academic paper:

### Part I: Foundation
- **Chapter 2**: Theoretical Foundation - The linguistic and mathematical basis
- **Chapter 3**: Implementation Details - How distfeat works internally

### Part II: Tutorials
- **By Complexity**: Starting from single distances to full research applications
- **By Use Case**: Practical scenarios from comparing words to analyzing sound changes

### Part III: Validation & Benchmarks
- **Chapter 4**: Experimental validation using cognate data
- **Chapter 5**: Performance benchmarks and optimization
- **Chapter 6**: Comparison with existing methods

### Part IV: Case Studies
Six detailed analyses across different language families:
- Indo-European (Grimm's Law)
- Austronesian (vowel harmony)
- Sino-Tibetan (tone development)
- Semitic (root patterns)
- Bantu (noun classes)
- Romance (lenition)

### Part V: API Reference
Complete documentation of all modules and functions

## Next Steps

- **For practitioners**: Jump to the [Quick Start](../tutorials/00_quickstart.ipynb) tutorial
- **For theorists**: Continue to [Chapter 2: Theoretical Foundation](02_theoretical_foundation.ipynb)
- **For developers**: See the [API Reference](../api/features.ipynb)
- **For researchers**: Explore the [Case Studies](../case_studies/indo_european.ipynb)