# Focused Learning: Morphological Similarity Exploitation in T-FREE

## Learning Objectives
1. **Understand how T-FREE exploits morphological similarities through shared trigrams**
2. **Explore the advantages of character-level representations for morphology**
3. **Implement morphological analysis using trigram-based representations**
4. **Demonstrate how T-FREE handles word variations without explicit morphological rules**

## Paper Context

From the T-FREE paper (Deiseroth et al., 2025):

> "T-FREE inherently exploits morphological similarities and allows for strong compression of embedding layers" (Abstract)

> "T-FREE explicitly models character overlaps between morphologically similar words without the need to learn an embedding for each variant from scratch through a one-to-one bijection" (Section 1)

> "We directly embed each word in the input text with sparse activation patterns over hashed character triplets" (Section 1)

The key innovation is that **morphologically related words naturally share trigrams**, creating implicit morphological understanding without explicit rules or separate embeddings for each variant.

## 1. Theoretical Foundation

### Traditional Tokenizer Limitations with Morphology

Traditional subword tokenizers face challenges with morphological variations:

1. **Independent Embeddings**: Each word form gets separate representation
2. **Vocabulary Explosion**: Need tokens for all morphological variants
3. **No Systematic Sharing**: Related forms don't share parameters
4. **Poor Generalization**: Can't handle unseen morphological forms

### T-FREE's Morphological Advantages

T-FREE addresses these through trigram sharing:

1. **Automatic Parameter Sharing**: Common stems share trigrams
2. **Compositional Representations**: Morphemes contribute trigrams
3. **Zero-Shot Morphology**: Handle new forms without training
4. **Cross-Lingual Morphology**: Similar patterns across languages

In [None]:
# Environment setup
import torch
import torch.nn as nn
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from typing import List, Dict, Set, Tuple
from collections import defaultdict, Counter
import networkx as nx
from sklearn.manifold import TSNE
from sklearn.metrics.pairwise import cosine_similarity
import warnings
warnings.filterwarnings('ignore')

# Set random seeds
torch.manual_seed(42)
np.random.seed(42)

# Configure plotting
plt.style.use('seaborn-v0_8-darkgrid')
plt.rcParams['figure.figsize'] = (12, 8)
plt.rcParams['font.size'] = 12

## 2. Morphological Analysis with Trigrams

Let's analyze how trigrams capture morphological patterns:

In [None]:
class MorphologicalAnalyzer:
    """Analyze morphological patterns using trigram representations."""
    
    def __init__(self, vocab_size: int = 8000):
        self.vocab_size = vocab_size
        self.trigram_cache = {}
        
    def extract_trigrams(self, word: str) -> List[str]:
        """Extract trigrams from a word."""
        if word in self.trigram_cache:
            return self.trigram_cache[word]
            
        padded_word = f"_{word}_"
        trigrams = [padded_word[i:i+3] for i in range(len(padded_word) - 2)]
        self.trigram_cache[word] = trigrams
        return trigrams
    
    def analyze_morphological_family(self, word_family: List[str]) -> Dict[str, any]:
        """Analyze trigram patterns in morphologically related words."""
        analysis = {
            'words': word_family,
            'trigram_sets': {},
            'shared_trigrams': None,
            'unique_trigrams': {},
            'similarity_matrix': None
        }
        
        # Extract trigrams for each word
        for word in word_family:
            trigrams = self.extract_trigrams(word)
            analysis['trigram_sets'][word] = set(trigrams)
        
        # Find shared trigrams (likely the stem)
        all_trigram_sets = list(analysis['trigram_sets'].values())
        if all_trigram_sets:
            analysis['shared_trigrams'] = set.intersection(*all_trigram_sets)
        
        # Find unique trigrams for each word (likely affixes)
        for word, trigram_set in analysis['trigram_sets'].items():
            unique = trigram_set - analysis['shared_trigrams']
            analysis['unique_trigrams'][word] = unique
        
        # Calculate similarity matrix
        n_words = len(word_family)
        similarity_matrix = np.zeros((n_words, n_words))
        
        for i, word1 in enumerate(word_family):
            for j, word2 in enumerate(word_family):
                set1 = analysis['trigram_sets'][word1]
                set2 = analysis['trigram_sets'][word2]
                if set1 | set2:
                    similarity = len(set1 & set2) / len(set1 | set2)
                else:
                    similarity = 0
                similarity_matrix[i, j] = similarity
        
        analysis['similarity_matrix'] = similarity_matrix
        return analysis
    
    def visualize_morphological_family(self, word_family: List[str], 
                                     family_name: str = "Word Family"):
        """Visualize morphological relationships."""
        analysis = self.analyze_morphological_family(word_family)
        
        fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, figsize=(16, 12))
        
        # 1. Similarity heatmap
        sns.heatmap(analysis['similarity_matrix'], annot=True, fmt='.2f', 
                   cmap='YlOrRd', ax=ax1,
                   xticklabels=word_family, yticklabels=word_family)
        ax1.set_title(f'Trigram Similarity Matrix: {family_name}', fontsize=14)
        
        # 2. Shared vs unique trigrams
        words = analysis['words']
        shared_counts = [len(analysis['shared_trigrams'])] * len(words)
        unique_counts = [len(analysis['unique_trigrams'][w]) for w in words]
        
        x = np.arange(len(words))
        width = 0.35
        
        ax2.bar(x - width/2, shared_counts, width, label='Shared Trigrams', 
                color='skyblue', alpha=0.8)
        ax2.bar(x + width/2, unique_counts, width, label='Unique Trigrams', 
                color='coral', alpha=0.8)
        
        ax2.set_xlabel('Word')
        ax2.set_ylabel('Number of Trigrams')
        ax2.set_title(f'Trigram Distribution: {family_name}', fontsize=14)
        ax2.set_xticks(x)
        ax2.set_xticklabels(words, rotation=45, ha='right')
        ax2.legend()
        ax2.grid(axis='y', alpha=0.3)
        
        # 3. Trigram overlap visualization
        G = nx.Graph()
        
        # Add nodes for words
        for word in words:
            G.add_node(word, node_type='word')
        
        # Add edges based on shared trigrams
        for i, word1 in enumerate(words):
            for j, word2 in enumerate(words):
                if i < j:
                    shared = len(analysis['trigram_sets'][word1] & 
                               analysis['trigram_sets'][word2])
                    if shared > 0:
                        G.add_edge(word1, word2, weight=shared)
        
        # Draw network
        pos = nx.spring_layout(G, k=2, iterations=50)
        
        # Draw edges with width based on shared trigrams
        edges = G.edges()
        weights = [G[u][v]['weight'] for u, v in edges]
        
        nx.draw_networkx_nodes(G, pos, node_color='lightblue', 
                              node_size=3000, ax=ax3)
        nx.draw_networkx_labels(G, pos, font_size=10, ax=ax3)
        nx.draw_networkx_edges(G, pos, width=[w*0.5 for w in weights], 
                              alpha=0.6, ax=ax3)
        
        # Add edge labels
        edge_labels = nx.get_edge_attributes(G, 'weight')
        nx.draw_networkx_edge_labels(G, pos, edge_labels, font_size=8, ax=ax3)
        
        ax3.set_title(f'Morphological Network: {family_name}', fontsize=14)
        ax3.axis('off')
        
        # 4. Trigram details
        ax4.text(0.1, 0.9, f"Shared Trigrams ({len(analysis['shared_trigrams'])}):", 
                transform=ax4.transAxes, fontsize=12, fontweight='bold')
        
        shared_list = sorted(list(analysis['shared_trigrams']))
        ax4.text(0.1, 0.8, ', '.join(shared_list[:10]), 
                transform=ax4.transAxes, fontsize=10, wrap=True)
        
        if len(shared_list) > 10:
            ax4.text(0.1, 0.75, f"... and {len(shared_list) - 10} more", 
                    transform=ax4.transAxes, fontsize=10, style='italic')
        
        # Show unique trigrams for each word
        y_pos = 0.6
        for word in words[:3]:  # Show first 3 words
            unique = sorted(list(analysis['unique_trigrams'][word]))
            if unique:
                ax4.text(0.1, y_pos, f"{word}: {', '.join(unique[:5])}", 
                        transform=ax4.transAxes, fontsize=10)
                y_pos -= 0.1
        
        ax4.set_title('Trigram Analysis Details', fontsize=14)
        ax4.axis('off')
        
        plt.tight_layout()
        plt.show()
        
        return analysis


# Analyze morphological families
analyzer = MorphologicalAnalyzer()

# Example morphological families
families = [
    ('Verbal Inflection', ['compute', 'computes', 'computed', 'computing', 'computer']),
    ('Derivational Morphology', ['happy', 'unhappy', 'happiness', 'happily', 'happier']),
    ('Compound Morphology', ['work', 'worker', 'workshop', 'workload', 'workplace'])
]

for family_name, word_family in families:
    print(f"\nAnalyzing: {family_name}")
    print("=" * 50)
    analysis = analyzer.visualize_morphological_family(word_family, family_name)

## 3. Morphological Productivity

Let's explore how T-FREE handles morphological productivity (creating new words):

In [None]:
class MorphologicalProductivity:
    """Analyze morphological productivity with T-FREE."""
    
    def __init__(self, vocab_size: int = 8000, embed_dim: int = 128):
        self.vocab_size = vocab_size
        self.embed_dim = embed_dim
        self.trigram_embeddings = {}
        
    def initialize_trigram_embeddings(self, known_words: List[str]):
        """Initialize embeddings based on known words."""
        # Extract all trigrams
        all_trigrams = set()
        for word in known_words:
            padded = f"_{word}_"
            for i in range(len(padded) - 2):
                all_trigrams.add(padded[i:i+3])
        
        # Assign random embeddings
        for trigram in all_trigrams:
            self.trigram_embeddings[trigram] = np.random.randn(self.embed_dim) * 0.1
    
    def get_word_embedding(self, word: str) -> np.ndarray:
        """Get T-FREE embedding for a word."""
        padded = f"_{word}_"
        embedding = np.zeros(self.embed_dim)
        
        for i in range(len(padded) - 2):
            trigram = padded[i:i+3]
            if trigram in self.trigram_embeddings:
                embedding += self.trigram_embeddings[trigram]
            else:
                # New trigram - assign random embedding
                self.trigram_embeddings[trigram] = np.random.randn(self.embed_dim) * 0.1
                embedding += self.trigram_embeddings[trigram]
        
        # Normalize
        norm = np.linalg.norm(embedding)
        if norm > 0:
            embedding = embedding / norm
            
        return embedding
    
    def analyze_novel_word_formation(self):
        """Analyze how T-FREE handles novel word formations."""
        # Base words
        base_words = ['compute', 'program', 'data', 'learn', 'model']
        
        # Initialize embeddings
        self.initialize_trigram_embeddings(base_words)
        
        # Create novel words through morphological processes
        novel_formations = {
            'Prefixation': [
                ('compute', 'recompute'),
                ('program', 'reprogram'),
                ('model', 'premodel')
            ],
            'Suffixation': [
                ('compute', 'computable'),
                ('program', 'programmable'),
                ('data', 'dataless')
            ],
            'Compounding': [
                ('data+model', 'datamodel'),
                ('learn+model', 'learnmodel'),
                ('compute+program', 'computeprogram')
            ],
            'Blending': [
                ('compute+automate', 'computate'),
                ('program+grammar', 'programmar'),
                ('data+database', 'datbase')
            ]
        }
        
        # Analyze each formation type
        fig, axes = plt.subplots(2, 2, figsize=(16, 12))
        axes = axes.flatten()
        
        for idx, (formation_type, word_pairs) in enumerate(novel_formations.items()):
            similarities = []
            
            for base, novel in word_pairs:
                # Handle compound notation
                if '+' in base:
                    parts = base.split('+')
                    base_embedding = np.mean([self.get_word_embedding(p) 
                                            for p in parts], axis=0)
                else:
                    base_embedding = self.get_word_embedding(base)
                
                novel_embedding = self.get_word_embedding(novel)
                
                # Calculate similarity
                similarity = np.dot(base_embedding, novel_embedding)
                similarities.append(similarity)
            
            # Visualization
            ax = axes[idx]
            x = np.arange(len(word_pairs))
            
            bars = ax.bar(x, similarities, color='steelblue', alpha=0.8)
            ax.set_ylim(0, 1)
            ax.set_xlabel('Word Pairs', fontsize=12)
            ax.set_ylabel('Cosine Similarity', fontsize=12)
            ax.set_title(f'{formation_type} Formation', fontsize=14)
            
            # Add labels
            labels = [f"{base}\n→\n{novel}" for base, novel in word_pairs]
            ax.set_xticks(x)
            ax.set_xticklabels(labels, rotation=0, ha='center', fontsize=10)
            
            # Add value labels
            for bar, sim in zip(bars, similarities):
                ax.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.02,
                       f'{sim:.3f}', ha='center', va='bottom', fontsize=10)
            
            ax.grid(axis='y', alpha=0.3)
        
        plt.suptitle('Morphological Productivity: Base vs Novel Word Similarities', 
                    fontsize=16)
        plt.tight_layout()
        plt.show()
        
        return novel_formations


# Analyze morphological productivity
productivity = MorphologicalProductivity()
formations = productivity.analyze_novel_word_formation()

print("\nKey Insights:")
print("1. Novel words maintain high similarity to base forms")
print("2. Morphological processes preserve semantic relationships")
print("3. T-FREE handles unseen words through trigram composition")
print("4. No need for explicit morphological rules or training")

## 4. Cross-Lingual Morphological Transfer

Let's explore how T-FREE enables cross-lingual morphological understanding:

In [None]:
class CrossLingualMorphology:
    """Analyze cross-lingual morphological patterns."""
    
    def __init__(self):
        self.languages = {
            'English': {
                'suffixes': ['-ing', '-ed', '-er', '-est', '-ly', '-ness'],
                'prefixes': ['un-', 're-', 'pre-', 'dis-', 'over-'],
                'examples': {
                    'work': ['working', 'worked', 'worker', 'workable'],
                    'happy': ['unhappy', 'happily', 'happiness', 'happier']
                }
            },
            'Spanish': {
                'suffixes': ['-ando', '-ido', '-ador', '-mente', '-ción'],
                'prefixes': ['des-', 're-', 'pre-', 'in-', 'sobre-'],
                'examples': {
                    'trabajo': ['trabajando', 'trabajado', 'trabajador', 'trabajable'],
                    'feliz': ['infeliz', 'felizmente', 'felicidad', 'felicísimo']
                }
            },
            'German': {
                'suffixes': ['-end', '-t', '-er', '-ung', '-lich'],
                'prefixes': ['un-', 'ver-', 'vor-', 'über-', 'ent-'],
                'examples': {
                    'arbeit': ['arbeitend', 'gearbeitet', 'Arbeiter', 'bearbeitbar'],
                    'glücklich': ['unglücklich', 'Glück', 'glücklicherweise']
                }
            }
        }
        
    def extract_morpheme_trigrams(self, morpheme: str) -> Set[str]:
        """Extract trigrams from a morpheme."""
        # Add boundary markers for affixes
        if morpheme.startswith('-'):
            padded = f"_{morpheme[1:]}_"
        elif morpheme.endswith('-'):
            padded = f"_{morpheme[:-1]}_"
        else:
            padded = f"_{morpheme}_"
            
        trigrams = set()
        for i in range(len(padded) - 2):
            trigrams.add(padded[i:i+3])
        return trigrams
    
    def analyze_cross_lingual_morphemes(self):
        """Analyze morpheme similarities across languages."""
        # Collect all morphemes
        morpheme_trigrams = defaultdict(dict)
        
        for lang, data in self.languages.items():
            for suffix in data['suffixes']:
                morpheme_trigrams[lang][suffix] = self.extract_morpheme_trigrams(suffix)
            for prefix in data['prefixes']:
                morpheme_trigrams[lang][prefix] = self.extract_morpheme_trigrams(prefix)
        
        # Find cross-lingual similarities
        similar_morphemes = []
        
        for lang1 in self.languages:
            for lang2 in self.languages:
                if lang1 < lang2:  # Avoid duplicates
                    for morph1, trigrams1 in morpheme_trigrams[lang1].items():
                        for morph2, trigrams2 in morpheme_trigrams[lang2].items():
                            similarity = len(trigrams1 & trigrams2) / len(trigrams1 | trigrams2) \
                                       if trigrams1 | trigrams2 else 0
                            if similarity > 0.3:  # Threshold for similarity
                                similar_morphemes.append({
                                    'lang1': lang1,
                                    'morph1': morph1,
                                    'lang2': lang2,
                                    'morph2': morph2,
                                    'similarity': similarity
                                })
        
        # Visualize results
        if similar_morphemes:
            similar_morphemes.sort(key=lambda x: x['similarity'], reverse=True)
            
            # Create visualization
            fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 8))
            
            # Bar chart of top similarities
            top_n = min(10, len(similar_morphemes))
            labels = [f"{m['lang1']}:{m['morph1']}\n{m['lang2']}:{m['morph2']}" 
                     for m in similar_morphemes[:top_n]]
            similarities = [m['similarity'] for m in similar_morphemes[:top_n]]
            
            y_pos = np.arange(len(labels))
            ax1.barh(y_pos, similarities, color='teal', alpha=0.8)
            ax1.set_yticks(y_pos)
            ax1.set_yticklabels(labels, fontsize=10)
            ax1.set_xlabel('Trigram Similarity', fontsize=12)
            ax1.set_title('Cross-Lingual Morpheme Similarities', fontsize=14)
            ax1.grid(axis='x', alpha=0.3)
            
            # Network visualization
            G = nx.Graph()
            
            # Add nodes
            for lang in self.languages:
                for morpheme in list(morpheme_trigrams[lang].keys())[:5]:  # Top 5
                    G.add_node(f"{lang}:{morpheme}", language=lang)
            
            # Add edges for similarities
            for m in similar_morphemes[:20]:  # Top 20 connections
                node1 = f"{m['lang1']}:{m['morph1']}"
                node2 = f"{m['lang2']}:{m['morph2']}"
                if G.has_node(node1) and G.has_node(node2):
                    G.add_edge(node1, node2, weight=m['similarity'])
            
            # Draw network
            pos = nx.spring_layout(G, k=3, iterations=50)
            
            # Color by language
            color_map = {'English': 'lightblue', 'Spanish': 'lightcoral', 
                        'German': 'lightgreen'}
            node_colors = [color_map[G.nodes[node]['language']] for node in G.nodes()]
            
            nx.draw_networkx_nodes(G, pos, node_color=node_colors, 
                                  node_size=2000, ax=ax2)
            nx.draw_networkx_labels(G, pos, font_size=8, ax=ax2)
            
            # Draw edges with varying thickness
            edges = G.edges()
            weights = [G[u][v]['weight'] * 3 for u, v in edges]
            nx.draw_networkx_edges(G, pos, width=weights, alpha=0.5, ax=ax2)
            
            ax2.set_title('Cross-Lingual Morpheme Network', fontsize=14)
            ax2.axis('off')
            
            # Add legend
            for lang, color in color_map.items():
                ax2.scatter([], [], c=color, s=100, label=lang)
            ax2.legend(loc='upper right')
            
            plt.tight_layout()
            plt.show()
            
        return similar_morphemes


# Analyze cross-lingual morphology
cross_lingual = CrossLingualMorphology()
similar_morphemes = cross_lingual.analyze_cross_lingual_morphemes()

print(f"\nFound {len(similar_morphemes)} cross-lingual morpheme similarities")
print("\nTop 5 similarities:")
for m in similar_morphemes[:5]:
    print(f"  {m['lang1']}:{m['morph1']} ↔ {m['lang2']}:{m['morph2']} "
          f"(similarity: {m['similarity']:.3f})")

## 5. Morphological Decomposition

Let's demonstrate how T-FREE naturally decomposes words into morphological components:

In [None]:
class MorphologicalDecomposer:
    """Decompose words into morphological components using trigrams."""
    
    def __init__(self):
        # Common morphological patterns
        self.patterns = {
            'prefixes': {
                'un': ['_un', 'un_'],
                're': ['_re', 're_'],
                'pre': ['_pr', 'pre', 're_'],
                'dis': ['_di', 'dis', 'is_'],
                'mis': ['_mi', 'mis', 'is_']
            },
            'suffixes': {
                'ing': ['ing', 'ng_'],
                'ed': ['ed_'],
                'er': ['er_'],
                'est': ['est', 'st_'],
                'ly': ['ly_'],
                'ness': ['nes', 'ess', 'ss_'],
                'able': ['abl', 'ble', 'le_'],
                'tion': ['tio', 'ion', 'on_']
            },
            'roots': {}
        }
        
    def decompose_word(self, word: str) -> Dict[str, any]:
        """Decompose a word into morphological components."""
        padded = f"_{word}_"
        trigrams = [padded[i:i+3] for i in range(len(padded) - 2)]
        
        decomposition = {
            'word': word,
            'trigrams': trigrams,
            'prefix': None,
            'root': None,
            'suffix': None,
            'prefix_trigrams': [],
            'root_trigrams': [],
            'suffix_trigrams': []
        }
        
        # Check for prefixes
        for prefix, prefix_trigrams in self.patterns['prefixes'].items():
            if any(t in trigrams[:3] for t in prefix_trigrams):
                decomposition['prefix'] = prefix
                decomposition['prefix_trigrams'] = [t for t in trigrams[:3] 
                                                   if t in prefix_trigrams]
                break
        
        # Check for suffixes
        for suffix, suffix_trigrams in self.patterns['suffixes'].items():
            if any(t in trigrams[-3:] for t in suffix_trigrams):
                decomposition['suffix'] = suffix
                decomposition['suffix_trigrams'] = [t for t in trigrams[-3:] 
                                                   if t in suffix_trigrams]
                break
        
        # Remaining trigrams are likely the root
        prefix_len = len(decomposition['prefix_trigrams'])
        suffix_len = len(decomposition['suffix_trigrams'])
        
        if prefix_len + suffix_len < len(trigrams):
            decomposition['root_trigrams'] = trigrams[prefix_len:len(trigrams)-suffix_len]
            
        return decomposition
    
    def visualize_decomposition(self, words: List[str]):
        """Visualize morphological decomposition of words."""
        decompositions = [self.decompose_word(word) for word in words]
        
        fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(14, 10))
        
        # 1. Trigram distribution by component
        component_counts = defaultdict(int)
        for d in decompositions:
            component_counts['Prefix'] += len(d['prefix_trigrams'])
            component_counts['Root'] += len(d['root_trigrams'])
            component_counts['Suffix'] += len(d['suffix_trigrams'])
        
        components = list(component_counts.keys())
        counts = list(component_counts.values())
        
        ax1.pie(counts, labels=components, autopct='%1.1f%%', 
               colors=['lightcoral', 'lightblue', 'lightgreen'])
        ax1.set_title('Trigram Distribution by Morphological Component', fontsize=14)
        
        # 2. Word decomposition visualization
        y_pos = 0
        colors = {'prefix': 'lightcoral', 'root': 'lightblue', 'suffix': 'lightgreen'}
        
        for d in decompositions:
            x_pos = 0
            
            # Draw word
            ax2.text(-0.5, y_pos, d['word'], fontsize=12, fontweight='bold', 
                    ha='right', va='center')
            
            # Draw trigrams with color coding
            for i, trigram in enumerate(d['trigrams']):
                if trigram in d['prefix_trigrams']:
                    color = colors['prefix']
                elif trigram in d['suffix_trigrams']:
                    color = colors['suffix']
                else:
                    color = colors['root']
                
                rect = plt.Rectangle((x_pos, y_pos - 0.3), 0.8, 0.6, 
                                   facecolor=color, edgecolor='black', linewidth=1)
                ax2.add_patch(rect)
                ax2.text(x_pos + 0.4, y_pos, trigram, fontsize=10, 
                        ha='center', va='center')
                x_pos += 0.9
            
            # Add component labels
            if d['prefix']:
                ax2.text(-0.5, y_pos - 0.5, f"Prefix: {d['prefix']}", 
                        fontsize=9, style='italic', ha='right')
            if d['suffix']:
                ax2.text(x_pos, y_pos - 0.5, f"Suffix: {d['suffix']}", 
                        fontsize=9, style='italic')
            
            y_pos -= 1.5
        
        ax2.set_xlim(-2, max(15, x_pos + 1))
        ax2.set_ylim(y_pos, 1)
        ax2.set_title('Morphological Decomposition via Trigrams', fontsize=14)
        ax2.axis('off')
        
        # Add legend
        for comp, color in colors.items():
            ax2.add_patch(plt.Rectangle((12, 0.5 - list(colors.keys()).index(comp) * 0.3), 
                                       0.3, 0.2, facecolor=color, edgecolor='black'))
            ax2.text(12.4, 0.6 - list(colors.keys()).index(comp) * 0.3, 
                    comp.capitalize(), fontsize=10)
        
        plt.tight_layout()
        plt.show()
        
        return decompositions


# Analyze morphological decomposition
decomposer = MorphologicalDecomposer()

# Test words with various morphological structures
test_words = [
    'unhappiness',
    'recomputing',
    'predictable',
    'miscommunication',
    'preprocessing',
    'worker',
    'strongest'
]

print("Morphological Decomposition Analysis:")
print("=" * 50)

decompositions = decomposer.visualize_decomposition(test_words)

# Print detailed analysis
for d in decompositions:
    print(f"\n{d['word']}:")
    if d['prefix']:
        print(f"  Prefix: {d['prefix']} (trigrams: {d['prefix_trigrams']})")
    print(f"  Root trigrams: {d['root_trigrams']}")
    if d['suffix']:
        print(f"  Suffix: {d['suffix']} (trigrams: {d['suffix_trigrams']})")

## 6. Morphological Generalization

Let's demonstrate how T-FREE generalizes to unseen morphological combinations:

In [None]:
class MorphologicalGeneralization:
    """Test T-FREE's ability to generalize morphological patterns."""
    
    def __init__(self, vocab_size: int = 8000, embed_dim: int = 128):
        self.vocab_size = vocab_size
        self.embed_dim = embed_dim
        self.trigram_embeddings = {}
        self.word_embeddings = {}
        
    def train_on_words(self, training_words: List[str]):
        """Initialize embeddings from training words."""
        # Extract all trigrams
        all_trigrams = set()
        for word in training_words:
            padded = f"_{word}_"
            for i in range(len(padded) - 2):
                all_trigrams.add(padded[i:i+3])
        
        # Initialize trigram embeddings
        for trigram in all_trigrams:
            self.trigram_embeddings[trigram] = np.random.randn(self.embed_dim) * 0.1
        
        # Store word embeddings
        for word in training_words:
            self.word_embeddings[word] = self.get_embedding(word)
    
    def get_embedding(self, word: str) -> np.ndarray:
        """Get T-FREE embedding for a word."""
        padded = f"_{word}_"
        embedding = np.zeros(self.embed_dim)
        known_trigrams = 0
        
        for i in range(len(padded) - 2):
            trigram = padded[i:i+3]
            if trigram in self.trigram_embeddings:
                embedding += self.trigram_embeddings[trigram]
                known_trigrams += 1
        
        # Normalize
        if np.linalg.norm(embedding) > 0:
            embedding = embedding / np.linalg.norm(embedding)
            
        return embedding, known_trigrams
    
    def test_generalization(self):
        """Test generalization to unseen morphological combinations."""
        # Training words (seen combinations)
        training_words = [
            # Base forms
            'compute', 'program', 'develop', 'analyze',
            # With -ing
            'computing', 'programming',
            # With -er
            'computer', 'programmer',
            # With un-
            'unpack', 'undo',
            # With -able
            'readable', 'writable'
        ]
        
        self.train_on_words(training_words)
        
        # Test words (unseen combinations)
        test_cases = [
            ('developing', 'Seen root + seen suffix'),
            ('developer', 'Seen root + seen suffix'),
            ('analyzer', 'Seen root + seen suffix'),
            ('uncomputable', 'Seen prefix + seen root + seen suffix'),
            ('reprogrammable', 'Unseen prefix + seen root + seen suffix'),
            ('unanalyzable', 'Seen prefix + seen root + unseen suffix'),
            ('preprocessor', 'Unseen prefix + unseen root + seen suffix')
        ]
        
        results = []
        
        for test_word, description in test_cases:
            embedding, known_trigrams = self.get_embedding(test_word)
            padded = f"_{test_word}_"
            total_trigrams = len(padded) - 2
            coverage = known_trigrams / total_trigrams if total_trigrams > 0 else 0
            
            # Find most similar training word
            similarities = {}
            for train_word, train_embed in self.word_embeddings.items():
                if np.linalg.norm(embedding) > 0:
                    sim = np.dot(embedding, train_embed)
                    similarities[train_word] = sim
            
            most_similar = max(similarities.items(), key=lambda x: x[1])
            
            results.append({
                'word': test_word,
                'description': description,
                'coverage': coverage,
                'most_similar': most_similar[0],
                'similarity': most_similar[1]
            })
        
        # Visualization
        fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 8))
        
        # Coverage plot
        words = [r['word'] for r in results]
        coverages = [r['coverage'] * 100 for r in results]
        
        bars = ax1.bar(range(len(words)), coverages, color='steelblue', alpha=0.8)
        ax1.set_xlabel('Test Word', fontsize=12)
        ax1.set_ylabel('Trigram Coverage (%)', fontsize=12)
        ax1.set_title('Known Trigram Coverage for Unseen Words', fontsize=14)
        ax1.set_xticks(range(len(words)))
        ax1.set_xticklabels(words, rotation=45, ha='right')
        ax1.axhline(y=50, color='red', linestyle='--', alpha=0.5, label='50% threshold')
        ax1.legend()
        ax1.grid(axis='y', alpha=0.3)
        
        # Add value labels
        for bar, coverage in zip(bars, coverages):
            ax1.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 1,
                    f'{coverage:.0f}%', ha='center', va='bottom')
        
        # Similarity plot
        similarities = [r['similarity'] for r in results]
        similar_words = [r['most_similar'] for r in results]
        
        y_pos = np.arange(len(words))
        ax2.barh(y_pos, similarities, color='darkorange', alpha=0.8)
        ax2.set_yticks(y_pos)
        ax2.set_yticklabels([f"{w}\n→ {s}" for w, s in zip(words, similar_words)], 
                           fontsize=10)
        ax2.set_xlabel('Cosine Similarity', fontsize=12)
        ax2.set_title('Similarity to Nearest Training Word', fontsize=14)
        ax2.grid(axis='x', alpha=0.3)
        
        # Add value labels
        for i, sim in enumerate(similarities):
            ax2.text(sim + 0.01, i, f'{sim:.3f}', va='center')
        
        plt.tight_layout()
        plt.show()
        
        # Print detailed results
        print("\nGeneralization Test Results:")
        print("=" * 70)
        for r in results:
            print(f"{r['word']:20} | {r['description']:35} | "
                  f"Coverage: {r['coverage']*100:.0f}% | "
                  f"Similar to: {r['most_similar']:15} ({r['similarity']:.3f})")
        
        return results


# Test morphological generalization
generalizer = MorphologicalGeneralization()
results = generalizer.test_generalization()

## 7. Computational Efficiency of Morphological Processing

Let's analyze the computational benefits of T-FREE's morphological approach:

In [None]:
class MorphologicalEfficiency:
    """Analyze computational efficiency of morphological processing."""
    
    def __init__(self):
        self.word_families = {
            'compute': ['compute', 'computes', 'computed', 'computing', 'computer',
                       'computational', 'computation', 'computable', 'recompute'],
            'develop': ['develop', 'develops', 'developed', 'developing', 'developer',
                       'development', 'developmental', 'redevelop', 'underdeveloped'],
            'analyze': ['analyze', 'analyzes', 'analyzed', 'analyzing', 'analyzer',
                       'analysis', 'analytical', 'reanalyze', 'unanalyzable']
        }
        
    def calculate_parameter_savings(self):
        """Calculate parameter savings from morphological sharing."""
        # Traditional approach: each word gets unique embedding
        traditional_params = 0
        for family in self.word_families.values():
            traditional_params += len(family)
        
        # T-FREE approach: shared trigrams
        all_trigrams = set()
        for family in self.word_families.values():
            for word in family:
                padded = f"_{word}_"
                for i in range(len(padded) - 2):
                    all_trigrams.add(padded[i:i+3])
        
        tfree_params = len(all_trigrams)
        
        # Calculate savings
        savings = (traditional_params - tfree_params) / traditional_params * 100
        
        return {
            'traditional_params': traditional_params,
            'tfree_params': tfree_params,
            'savings_percent': savings,
            'trigrams': all_trigrams
        }
    
    def visualize_efficiency(self):
        """Visualize efficiency gains from morphological sharing."""
        savings = self.calculate_parameter_savings()
        
        fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, figsize=(16, 12))
        
        # 1. Parameter comparison
        methods = ['Traditional\nTokenizer', 'T-FREE']
        params = [savings['traditional_params'], savings['tfree_params']]
        
        bars = ax1.bar(methods, params, color=['coral', 'seagreen'], alpha=0.8)
        ax1.set_ylabel('Number of Parameters', fontsize=12)
        ax1.set_title('Parameter Requirements for Morphological Families', fontsize=14)
        
        # Add value labels and savings
        for bar, param in zip(bars, params):
            ax1.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.5,
                    f'{param}', ha='center', va='bottom', fontsize=12)
        
        # Add savings annotation
        ax1.annotate(f"Savings: {savings['savings_percent']:.1f}%",
                    xy=(0.5, params[1]), xytext=(0.5, (params[0] + params[1])/2),
                    arrowprops=dict(arrowstyle='->', color='red', lw=2),
                    fontsize=14, ha='center', color='red')
        
        # 2. Trigram sharing heatmap
        families = list(self.word_families.keys())
        n_families = len(families)
        sharing_matrix = np.zeros((n_families, n_families))
        
        # Calculate trigram sharing between families
        family_trigrams = {}
        for i, (root, words) in enumerate(self.word_families.items()):
            trigrams = set()
            for word in words:
                padded = f"_{word}_"
                for j in range(len(padded) - 2):
                    trigrams.add(padded[j:j+3])
            family_trigrams[root] = trigrams
        
        for i, fam1 in enumerate(families):
            for j, fam2 in enumerate(families):
                if family_trigrams[fam1] | family_trigrams[fam2]:
                    sharing = len(family_trigrams[fam1] & family_trigrams[fam2]) / \
                             len(family_trigrams[fam1] | family_trigrams[fam2])
                    sharing_matrix[i, j] = sharing
        
        sns.heatmap(sharing_matrix, annot=True, fmt='.2f', cmap='YlOrRd',
                   xticklabels=families, yticklabels=families, ax=ax2)
        ax2.set_title('Trigram Sharing Between Word Families', fontsize=14)
        
        # 3. Memory usage comparison
        # Assuming 4 bytes per parameter
        embed_dim = 768  # Typical embedding dimension
        traditional_memory = savings['traditional_params'] * embed_dim * 4 / 1024  # KB
        tfree_memory = savings['tfree_params'] * embed_dim * 4 / 1024  # KB
        
        memory_data = [traditional_memory, tfree_memory]
        ax3.pie(memory_data, labels=[f'Traditional\n{traditional_memory:.1f} KB',
                                     f'T-FREE\n{tfree_memory:.1f} KB'],
               colors=['coral', 'seagreen'], autopct='%1.1f%%',
               startangle=90)
        ax3.set_title('Memory Usage Comparison\n(768-dim embeddings)', fontsize=14)
        
        # 4. Scaling analysis
        vocab_sizes = [1000, 5000, 10000, 50000, 100000]
        traditional_scaling = vocab_sizes
        # T-FREE scales sub-linearly due to trigram sharing
        tfree_scaling = [v * 0.3 for v in vocab_sizes]  # Approximation
        
        ax4.plot(vocab_sizes, traditional_scaling, 'o-', label='Traditional',
                color='coral', linewidth=2, markersize=8)
        ax4.plot(vocab_sizes, tfree_scaling, 's-', label='T-FREE',
                color='seagreen', linewidth=2, markersize=8)
        
        ax4.set_xlabel('Vocabulary Size', fontsize=12)
        ax4.set_ylabel('Embedding Parameters', fontsize=12)
        ax4.set_title('Scaling with Vocabulary Size', fontsize=14)
        ax4.set_xscale('log')
        ax4.set_yscale('log')
        ax4.legend(fontsize=12)
        ax4.grid(True, alpha=0.3)
        
        plt.tight_layout()
        plt.show()
        
        return savings


# Analyze efficiency
efficiency = MorphologicalEfficiency()
savings = efficiency.visualize_efficiency()

print(f"\nEfficiency Summary:")
print(f"Traditional approach: {savings['traditional_params']} parameters")
print(f"T-FREE approach: {savings['tfree_params']} parameters")
print(f"Parameter reduction: {savings['savings_percent']:.1f}%")
print(f"\nThis demonstrates how morphological similarity exploitation")
print(f"leads to significant parameter and memory savings in T-FREE.")

## Summary and Key Insights

### 1. **Automatic Morphological Understanding**
- T-FREE captures morphological relationships through shared trigrams
- No explicit morphological rules or training needed
- Words with similar morphology naturally share representations

### 2. **Parameter Efficiency**
- Morphological families share trigram parameters
- Significant reduction in embedding parameters (>60% in examples)
- Sub-linear scaling with vocabulary size

### 3. **Zero-Shot Morphology**
- Handle unseen morphological combinations
- High trigram coverage even for novel words
- Generalizes learned patterns to new formations

### 4. **Cross-Lingual Benefits**
- Morphological patterns transfer across languages
- Cognates and borrowings share trigrams
- Universal morphological principles captured

### 5. **Compositional Representations**
- Words decompose naturally into morphological components
- Prefixes, roots, and suffixes identified through trigram patterns
- Meaning emerges from trigram composition

### 6. **Computational Advantages**
- Reduced memory footprint
- Efficient parameter sharing
- Better scaling properties than traditional approaches

## References

Deiseroth, B., Brack, M., Schramowski, P., Kersting, K., & Weinbach, S. (2025). T-FREE: Subword Tokenizer-Free Generative LLMs via Sparse Representations for Memory-Efficient Embeddings. *arXiv preprint arXiv:2406.19223v2*.