# Features Module API Reference

The `distfeat.features` module provides the core functionality for converting between phonemes and phonetic feature vectors. This module handles:

- Loading and caching phonetic feature systems
- Converting phonemes to feature vectors
- Converting feature vectors back to phonemes
- Managing custom feature systems
- Filtering and accessing feature data

## Quick Start

```python
from distfeat import phoneme_to_features, features_to_phoneme, get_feature_system

# Convert phoneme to features
features = phoneme_to_features('p')
print(features['voice'])  # 0 (voiceless)

# Convert features back to phoneme
phoneme = features_to_phoneme({'voice': 0, 'labial': 1})
print(phoneme)  # 'p' or similar voiceless bilabial

# Access full feature system
system = get_feature_system()
print(len(system))  # Number of phonemes
```

## Core Functions

### phoneme_to_features()

Convert a phoneme (IPA symbol) to its phonetic feature vector.

**Signature:**
```python
phoneme_to_features(
    phoneme: str, 
    system: Optional[str] = None,
    on_error: str = 'warn'
) -> Optional[Dict[str, int]]
```

**Parameters:**
- `phoneme` (str): IPA phoneme symbol to convert
- `system` (Optional[str]): Feature system to use (None for default CLTS)
- `on_error` (str): Error handling mode - 'raise', 'warn', or 'ignore'

**Returns:**
- `Dict[str, int]` or `None`: Feature dictionary with binary values (0/1), or None if phoneme not found

**Raises:**
- `ValueError`: If phoneme not found and on_error='raise'
- `ValueError`: If unknown feature system specified

In [None]:
# Examples of phoneme_to_features()
import sys
sys.path.append('/home/tiagot/tiatre/unipa/distfeat')
from distfeat import phoneme_to_features

# Basic usage - voiceless bilabial stop
p_features = phoneme_to_features('p')
print("Features for /p/:")
for feature, value in sorted(p_features.items()):
    if value == 1:  # Only show positive features
        print(f"  {feature}: {value}")

print("\n" + "="*50 + "\n")

# Comparing voiced vs voiceless
b_features = phoneme_to_features('b')
print("Voice feature comparison:")
print(f"  /p/ voice: {p_features['voice']}")
print(f"  /b/ voice: {b_features['voice']}")

print("\n" + "="*50 + "\n")

# Error handling examples
print("Error handling examples:")

# Default: warn and return None
result = phoneme_to_features('zzz')
print(f"Unknown phoneme with warn: {result}")

# Ignore: silent and return None
result = phoneme_to_features('zzz', on_error='ignore')
print(f"Unknown phoneme with ignore: {result}")

# Raise: will throw exception (commented out)
# result = phoneme_to_features('zzz', on_error='raise')  # Would raise ValueError

### features_to_phoneme()

Find the phoneme that best matches a given feature vector.

**Signature:**
```python
features_to_phoneme(
    features: Dict[str, int],
    system: Optional[str] = None,
    threshold: float = 1.0
) -> Optional[str]
```

**Parameters:**
- `features` (Dict[str, int]): Feature dictionary to match against
- `system` (Optional[str]): Feature system to use (None for default)
- `threshold` (float): Minimum similarity threshold (0.0 to 1.0)

**Returns:**
- `str` or `None`: Best matching phoneme, or None if no match above threshold

**Algorithm:**
The function calculates similarity as the fraction of features that match exactly between the input and each phoneme in the system. A score of 1.0 means perfect match, 0.0 means no features match.

In [None]:
# Examples of features_to_phoneme()
from distfeat import features_to_phoneme

# Exact match - get features for 'p' and convert back
p_features = phoneme_to_features('p')
recovered = features_to_phoneme(p_features)
print(f"Original: p, Recovered: {recovered}")
print(f"Match quality: {'Perfect' if recovered == 'p' else 'Feature equivalent'}")

print("\n" + "="*50 + "\n")

# Partial match with lower threshold
partial_features = {'voice': 0, 'labial': 1, 'consonantal': 1}
matches = []
for threshold in [1.0, 0.8, 0.6, 0.4]:
    match = features_to_phoneme(partial_features, threshold=threshold)
    matches.append(f"threshold {threshold}: {match}")

print("Partial feature matching (voice=0, labial=1, consonantal=1):")
for match in matches:
    print(f"  {match}")

print("\n" + "="*50 + "\n")

# Impossible feature combination
impossible = {'voice': 2, 'labial': 3}  # Invalid values
result = features_to_phoneme(impossible)
print(f"Impossible features result: {result}")

### get_feature_system()

Access the complete phonetic feature system with filtering and format options.

**Signature:**
```python
get_feature_system(
    system: Optional[str] = None,
    as_matrix: bool = False,
    exclude_clicks: bool = False,
    exclude_tones: bool = False,
    exclude_diacritics: bool = False
) -> Union[Dict[str, Dict], np.ndarray]
```

**Parameters:**
- `system` (Optional[str]): Feature system name (None for default CLTS)
- `as_matrix` (bool): Return as numpy matrix instead of dictionary
- `exclude_clicks` (bool): Filter out click consonants (ǀǁǂǃʘ)
- `exclude_tones` (bool): Filter out tonal features and tone marks
- `exclude_diacritics` (bool): Filter out phonemes with diacritical marks

**Returns:**
- `Dict[str, Dict]`: Dictionary mapping phonemes to feature data (default)
- `np.ndarray`: Feature matrix with phonemes as rows, features as columns

**Dictionary Structure:**
Each phoneme maps to a dictionary containing:
- `'features'`: Dict[str, int] - Binary feature vector
- `'name'`: str - Human-readable description
- `'generated'`: bool - Whether phoneme was algorithmically generated

In [None]:
# Examples of get_feature_system()
from distfeat import get_feature_system
import numpy as np

# Basic usage - get full system
system = get_feature_system()
print(f"Total phonemes in system: {len(system)}")

# Show structure for one phoneme
phoneme = 'p'
if phoneme in system:
    data = system[phoneme]
    print(f"\nStructure for /{phoneme}/:")
    print(f"  Name: {data['name']}")
    print(f"  Generated: {data['generated']}")
    print(f"  Number of features: {len(data['features'])}")
    print(f"  Positive features: {sum(data['features'].values())}")

print("\n" + "="*50 + "\n")

# Filtering examples
print("Filtering examples:")
full_count = len(get_feature_system())
no_clicks = len(get_feature_system(exclude_clicks=True))
no_tones = len(get_feature_system(exclude_tones=True))
no_diacritics = len(get_feature_system(exclude_diacritics=True))

print(f"  Full system: {full_count} phonemes")
print(f"  Without clicks: {no_clicks} phonemes ({full_count - no_clicks} clicks filtered)")
print(f"  Without tones: {no_tones} phonemes ({full_count - no_tones} tonal filtered)")
print(f"  Without diacritics: {no_diacritics} phonemes ({full_count - no_diacritics} diacritics filtered)")

print("\n" + "="*50 + "\n")

# Matrix format
matrix = get_feature_system(as_matrix=True)
print(f"Matrix format: {matrix.shape} (phonemes × features)")
print(f"Matrix dtype: {matrix.dtype}")
print(f"Feature density: {np.mean(matrix):.3f} (fraction of positive features)")

# Show a few rows of the matrix
print("\nFirst 5 phonemes, first 10 features:")
print(matrix[:5, :10].astype(int))

### get_feature_names()

Get the ordered list of feature names in the system.

**Signature:**
```python
get_feature_names(system: Optional[str] = None) -> List[str]
```

**Parameters:**
- `system` (Optional[str]): Feature system name (None for default)

**Returns:**
- `List[str]`: Ordered list of feature names

**Usage:**
This function is essential for:
- Understanding what features are available
- Converting between dictionary and array representations
- Ensuring consistent feature ordering across operations

In [None]:
# Examples of get_feature_names()
from distfeat import get_feature_names

# Get all feature names
feature_names = get_feature_names()
print(f"Total features: {len(feature_names)}")

# Show first 20 features
print("\nFirst 20 features:")
for i, name in enumerate(feature_names[:20]):
    print(f"  {i+1:2d}. {name}")

print("\n" + "="*50 + "\n")

# Feature categories (approximate grouping by name patterns)
categories = {
    'manner': [f for f in feature_names if any(x in f for x in ['stop', 'fricative', 'nasal', 'liquid', 'glide'])],
    'place': [f for f in feature_names if any(x in f for x in ['labial', 'coronal', 'dorsal', 'radical'])],
    'voice': [f for f in feature_names if 'voice' in f],
    'vowel': [f for f in feature_names if any(x in f for x in ['high', 'low', 'back', 'front', 'round'])]
}

print("Feature categories (approximate):")
for category, features in categories.items():
    if features:
        print(f"  {category.capitalize()}: {len(features)} features")
        if len(features) <= 5:
            print(f"    {', '.join(features)}")
        else:
            print(f"    {', '.join(features[:3])}... (+{len(features)-3} more)")

# Verify feature consistency with phoneme data
p_features = phoneme_to_features('p')
print(f"\nFeature consistency check:")
print(f"  Feature names count: {len(feature_names)}")
print(f"  Phoneme feature count: {len(p_features)}")
print(f"  Sets match: {set(feature_names) == set(p_features.keys())}")

### load_custom_features()

Load a custom feature system from a file and register it for use.

**Signature:**
```python
load_custom_features(
    path: Union[str, Path],
    name: str,
    delimiter: str = '\t',
    phoneme_col: str = 'phoneme',
    exclude_cols: Optional[List[str]] = None
) -> None
```

**Parameters:**
- `path` (Union[str, Path]): Path to CSV/TSV file containing feature data
- `name` (str): Name to register the system under
- `delimiter` (str): Column delimiter ('\t' for TSV, ',' for CSV)
- `phoneme_col` (str): Name of column containing phoneme symbols
- `exclude_cols` (Optional[List[str]]): Columns to exclude from features

**File Format:**
The input file should be a delimited text file with:
- One row per phoneme
- One column with phoneme symbols
- Additional columns for each feature
- Feature values: binary (0/1) or ternary (-1/0/1)
- Empty values treated as 0

**Default Exclusions:**
- `phoneme_col` (phoneme symbol column)
- `'description'`, `'name'`, `'note'` (metadata columns)

**Raises:**
- `FileNotFoundError`: If the specified file doesn't exist

In [None]:
# Example of load_custom_features()
from distfeat import load_custom_features
import tempfile
import os

# Create a simple custom feature system for demonstration
custom_data = '''phoneme,consonantal,vocalic,voice,labial
p,1,0,0,1
b,1,0,1,1
t,1,0,0,0
d,1,0,1,0
a,0,1,1,0
i,0,1,1,0'''

# Write to temporary file
with tempfile.NamedTemporaryFile(mode='w', suffix='.csv', delete=False) as f:
    f.write(custom_data)
    temp_path = f.name

try:
    # Load the custom system
    load_custom_features(
        path=temp_path,
        name='simple_demo',
        delimiter=',',
        phoneme_col='phoneme'
    )
    
    print("Custom feature system loaded successfully!")
    
    # Test using the custom system
    p_features = phoneme_to_features('p', system='simple_demo')
    print(f"\nFeatures for /p/ in custom system: {p_features}")
    
    # Get the custom system
    custom_system = get_feature_system(system='simple_demo')
    print(f"\nCustom system phonemes: {list(custom_system.keys())}")
    
    # Show feature names in custom system
    custom_features = get_feature_names(system='simple_demo')
    print(f"Custom feature names: {custom_features}")
    
finally:
    # Clean up temporary file
    os.unlink(temp_path)

print("\n" + "="*50 + "\n")
print("Custom system examples:")
print("- Linguistic research with specialized feature sets")
print("- Comparison with different theoretical frameworks")
print("- Domain-specific applications (e.g., speech therapy)")
print("- Experimental feature systems")

## Advanced Usage

### Feature System Comparison

Compare different feature systems or theoretical approaches:

In [None]:
# Advanced: Feature system comparison
from distfeat import phoneme_to_features, get_feature_names

# Compare features across different phonemes
phonemes = ['p', 'b', 't', 'd', 'k', 'g']
comparison_features = ['voice', 'labial', 'coronal', 'dorsal']

print("Stop consonant feature comparison:")
print(f"{'Phoneme':<8} {' '.join(f'{f:<8}' for f in comparison_features)}")
print("-" * (8 + len(comparison_features) * 9))

for phoneme in phonemes:
    features = phoneme_to_features(phoneme)
    if features:
        values = [str(features.get(f, 0)) for f in comparison_features]
        print(f"{phoneme:<8} {' '.join(f'{v:<8}' for v in values)}")

### Feature-Based Phoneme Search

Find phonemes that match specific feature combinations:

In [None]:
# Advanced: Find phonemes by feature criteria
def find_phonemes_by_features(feature_criteria, max_results=10):
    """
    Find phonemes matching specific feature criteria.
    
    Args:
        feature_criteria: Dict of feature_name: required_value
        max_results: Maximum number of results to return
    
    Returns:
        List of matching phonemes
    """
    system = get_feature_system()
    matches = []
    
    for phoneme, data in system.items():
        features = data['features']
        
        # Check if all criteria are met
        if all(features.get(feat, 0) == val for feat, val in feature_criteria.items()):
            matches.append(phoneme)
            
        if len(matches) >= max_results:
            break
    
    return matches

# Find voiced stops
voiced_stops = find_phonemes_by_features({
    'voice': 1,
    'consonantal': 1,
    'continuant': 0  # stops are non-continuant
})
print(f"Voiced stops: {voiced_stops[:10]}")

# Find high vowels
high_vowels = find_phonemes_by_features({
    'high': 1,
    'consonantal': 0,
    'vocalic': 1
})
print(f"High vowels: {high_vowels[:10]}")

# Find fricatives
fricatives = find_phonemes_by_features({
    'consonantal': 1,
    'continuant': 1,
    'strident': 1
})
print(f"Fricatives: {fricatives[:10]}")

### Performance and Caching

The features module uses several optimization strategies:

1. **Global Caching**: Feature system loaded once and cached globally
2. **LRU Cache**: `phoneme_to_features()` uses `@lru_cache(maxsize=1024)` for fast repeated lookups
3. **Lazy Loading**: Feature system only loaded when first accessed
4. **Memory Efficiency**: Binary features stored as integers, not booleans

**Cache Statistics:**

In [None]:
# Performance and caching examples
import time

# Clear cache and measure cold start
phoneme_to_features.cache_clear()

# Measure warm-up time
start = time.time()
for phoneme in ['p', 'b', 't', 'd', 'k', 'g', 'a', 'e', 'i', 'o', 'u']:
    phoneme_to_features(phoneme)
warmup_time = time.time() - start

# Measure cached access time
start = time.time()
for _ in range(100):
    for phoneme in ['p', 'b', 't', 'd', 'k', 'g', 'a', 'e', 'i', 'o', 'u']:
        phoneme_to_features(phoneme)
cached_time = time.time() - start

# Cache statistics
cache_info = phoneme_to_features.cache_info()

print(f"Performance metrics:")
print(f"  Cold start (11 phonemes): {warmup_time:.6f} seconds")
print(f"  Cached access (1100 lookups): {cached_time:.6f} seconds")
print(f"  Speedup factor: {warmup_time / cached_time * 100:.1f}x")
print(f"\nCache statistics:")
print(f"  Hits: {cache_info.hits}")
print(f"  Misses: {cache_info.misses}")
print(f"  Cache size: {cache_info.currsize}/{cache_info.maxsize}")
print(f"  Hit rate: {cache_info.hits / (cache_info.hits + cache_info.misses) * 100:.1f}%")

## Error Handling

The features module provides comprehensive error handling:

### Common Error Scenarios

In [None]:
# Error handling examples
import warnings

print("Error handling scenarios:")

# 1. Unknown phoneme with different error modes
print("\n1. Unknown phoneme handling:")

# Warn mode (default)
with warnings.catch_warnings(record=True) as w:
    warnings.simplefilter("always")
    result = phoneme_to_features('xyz')
    print(f"   warn mode: result={result}, warnings={len(w)}")

# Ignore mode
result = phoneme_to_features('xyz', on_error='ignore')
print(f"   ignore mode: result={result}")

# Raise mode
try:
    phoneme_to_features('xyz', on_error='raise')
    print("   raise mode: no error (unexpected!)")
except ValueError as e:
    print(f"   raise mode: ValueError caught - {str(e)[:50]}...")

# 2. Unknown feature system
print("\n2. Unknown feature system:")
try:
    phoneme_to_features('p', system='nonexistent')
except ValueError as e:
    print(f"   ValueError: {e}")

# 3. Missing feature file (simulation)
print("\n3. File handling:")
try:
    load_custom_features('/nonexistent/file.csv', 'test')
except FileNotFoundError as e:
    print(f"   FileNotFoundError: {str(e)[:60]}...")

print("\nError handling best practices:")
print("- Use 'ignore' for bulk processing where missing data is expected")
print("- Use 'warn' (default) for interactive analysis")
print("- Use 'raise' for strict validation in critical applications")
print("- Always validate custom feature files before loading")

## Integration Examples

### Working with Distance Functions

The features module integrates seamlessly with the distance calculation system:

In [None]:
# Integration with distance functions
from distfeat import calculate_distance, phoneme_to_features

# Manual feature comparison
p_features = phoneme_to_features('p')
b_features = phoneme_to_features('b')

# Count different features manually
differences = sum(1 for f in p_features if p_features[f] != b_features[f])
manual_hamming = differences / len(p_features)

# Compare with distance function
auto_hamming = calculate_distance('p', 'b', method='hamming')

print(f"Manual Hamming calculation: {manual_hamming:.3f}")
print(f"Automatic Hamming distance: {auto_hamming:.3f}")
print(f"Match: {abs(manual_hamming - auto_hamming) < 1e-6}")

# Feature-based natural classes
voiceless_stops = ['p', 't', 'k']
voiced_stops = ['b', 'd', 'g']

print("\nNatural class distances:")
print("Within voiceless stops:")
for i, p1 in enumerate(voiceless_stops):
    for p2 in voiceless_stops[i+1:]:
        dist = calculate_distance(p1, p2)
        print(f"  {p1}-{p2}: {dist:.3f}")

print("Between voiceless and voiced:")
for p1, p2 in zip(voiceless_stops, voiced_stops):
    dist = calculate_distance(p1, p2)
    print(f"  {p1}-{p2}: {dist:.3f}")

## Implementation Notes

### Data Format

The features module handles the conversion from the CLTS BIPA format to internal binary features:

- **Input**: CSV with phoneme column and feature columns containing -1, 0, 1
- **Internal**: Binary features where positive values (1) become 1, others become 0
- **Rationale**: Binary features simplify distance calculations and reduce memory usage

### Thread Safety

The module is designed to be thread-safe:

- Global cache initialized with locks
- Immutable feature data after initialization
- LRU cache is thread-safe in Python 3.2+

### Memory Management

- Feature system loaded once and shared
- Feature dictionaries are copied on return to prevent mutation
- LRU cache prevents unbounded memory growth from repeated queries

### Extension Points

The module provides several extension points for customization:

1. **Custom feature systems** via `load_custom_features()`
2. **Alternative data sources** by modifying `_load_bundled_features()`
3. **Feature filtering** via filtering parameters in `get_feature_system()`
4. **Feature transformations** by preprocessing data before loading

## See Also

- **[Distance Module API](distances.ipynb)**: Calculate phonetic distances using features
- **[Implementation Details](../chapters/03_implementation.ipynb)**: Internal architecture and algorithms
- **[Validation](../chapters/04_validation.ipynb)**: Testing and validation of feature systems
- **[User Guide](../chapters/01_quickstart.ipynb)**: Getting started with distfeat

## References

- **CLTS**: Cross-Linguistic Transcription Systems - source of phonetic feature data
- **BIPA**: Broad IPA - standardized IPA symbol inventory
- **Phonological Features**: Chomsky & Halle (1968), Hayes (2009), and modern feature theory