# Computational Linguistics: Statistical Analysis of Natural Language

## Introduction

This notebook explores fundamental concepts in **statistical linguistics** and **quantitative language analysis**. We'll analyze a multilingual dataset containing parallel content across four languages to uncover universal patterns and language-specific characteristics.

### Dataset Overview

Our dataset contains text samples from four languages:
- **English** (Germanic, Latin script)
- **Spanish** (Romance, Latin script)
- **German** (Germanic, Latin script)
- **Japanese** (Japonic, mixed script - Kanji, Hiragana, Katakana)

### Statistical Linguistics Methods

We'll explore several key concepts:

1. **Zipf's Law**: A fundamental observation that word frequency is inversely proportional to rank (frequency ∝ 1/rank)
2. **Lexical Diversity**: Measures of vocabulary richness including Type-Token Ratio (TTR)
3. **Power Law Distribution**: Mathematical relationship describing word frequency distributions
4. **N-gram Analysis**: Patterns in consecutive word sequences
5. **Cross-linguistic Comparison**: Universal vs language-specific statistical properties

These methods form the foundation of computational linguistics, corpus linguistics, and natural language processing.

## Setup: Import Required Libraries

We'll use standard scientific Python libraries along with NLTK for natural language processing.

In [None]:
# Data manipulation and analysis
import numpy as np
import pandas as pd

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Statistical analysis
from scipy import stats

# Natural language processing
import nltk
import re
from collections import Counter

# Configure visualization settings
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")
%matplotlib inline

# Set random seed for reproducibility
np.random.seed(42)

print("All libraries imported successfully!")

## Load and Explore the Multilingual Dataset

First, we'll load the CSV file containing texts from four languages and examine its structure.

In [None]:
# Load the multilingual dataset
df = pd.read_csv('../data/multilingual_texts.csv')

# Display basic information
print("Dataset shape:", df.shape)
print("\nColumn names:", df.columns.tolist())
print("\nLanguages in dataset:", df['language'].unique())
print("\nRows per language:")
print(df['language'].value_counts())

# Display first few rows
print("\nFirst few rows:")
df.head()

In [None]:
# Check for missing values
print("Missing values per column:")
print(df.isnull().sum())

# Basic text statistics
df['text_length'] = df['text'].str.len()
print("\nText length statistics by language:")
df.groupby('language')['text_length'].describe()

## Tokenization: Breaking Text into Words

**Tokenization** is the process of splitting text into individual units (tokens), typically words. This is the foundation for most linguistic analysis.

### Challenges:
- **Latin scripts** (English, Spanish, German): Words separated by spaces
- **Japanese**: No word boundaries, mixed scripts (Kanji, Hiragana, Katakana)

We'll use regular expressions for Latin scripts and character-based tokenization for Japanese.

In [None]:
def tokenize_text(text, language='english'):
    """
    Tokenize text into words, handling different scripts.
    
    For Latin scripts: split on whitespace and punctuation
    For Japanese: use character-based tokenization (simplified)
    """
    if language.lower() == 'japanese':
        # For Japanese, we'll use a simple character-based approach
        # In production, use tools like MeCab or SudachiPy
        # Here we'll split into characters and filter out punctuation
        tokens = [char for char in text if char.strip() and not re.match(r'[\s\p{P}]', char)]
    else:
        # For Latin scripts: lowercase and split on word boundaries
        text = text.lower()
        # Keep only alphabetic characters and spaces
        text = re.sub(r'[^a-záéíóúñü\s]', '', text)
        tokens = text.split()
    
    return [t for t in tokens if t]  # Remove empty strings

# Apply tokenization to each language
df['tokens'] = df.apply(lambda row: tokenize_text(row['text'], row['language']), axis=1)
df['token_count'] = df['tokens'].apply(len)

# Display results
print("Token count statistics by language:")
print(df.groupby('language')['token_count'].describe())

# Show example tokenization
print("\nExample tokenization for each language:")
for lang in df['language'].unique():
    sample = df[df['language'] == lang].iloc[0]
    print(f"\n{lang}:")
    print(f"Original: {sample['text'][:100]}...")
    print(f"Tokens (first 10): {sample['tokens'][:10]}")

## Word Frequency Analysis

**Word frequency** is a fundamental measure in linguistics. The most frequent words in any language tend to be **function words** (articles, prepositions, conjunctions) while **content words** (nouns, verbs, adjectives) are less frequent but carry more semantic meaning.

In [None]:
# Calculate word frequencies for each language
language_frequencies = {}

for lang in df['language'].unique():
    # Combine all tokens for this language
    all_tokens = []
    for tokens in df[df['language'] == lang]['tokens']:
        all_tokens.extend(tokens)
    
    # Count frequencies
    freq_counter = Counter(all_tokens)
    language_frequencies[lang] = freq_counter
    
    print(f"\n{lang.upper()}:")
    print(f"Total tokens: {len(all_tokens):,}")
    print(f"Unique tokens (vocabulary size): {len(freq_counter):,}")
    print(f"\nTop 10 most frequent words:")
    for word, count in freq_counter.most_common(10):
        print(f"  {word}: {count}")

In [None]:
# Visualize top words per language
fig, axes = plt.subplots(2, 2, figsize=(16, 12))
axes = axes.flatten()

for idx, lang in enumerate(sorted(df['language'].unique())):
    freq_counter = language_frequencies[lang]
    top_words = freq_counter.most_common(15)
    words, counts = zip(*top_words)
    
    axes[idx].barh(range(len(words)), counts, color=sns.color_palette("husl", 4)[idx])
    axes[idx].set_yticks(range(len(words)))
    axes[idx].set_yticklabels(words)
    axes[idx].invert_yaxis()
    axes[idx].set_xlabel('Frequency', fontsize=12)
    axes[idx].set_title(f'Top 15 Words in {lang}', fontsize=14, fontweight='bold')
    axes[idx].grid(axis='x', alpha=0.3)

plt.tight_layout()
plt.savefig('word_frequencies.png', dpi=300, bbox_inches='tight')
plt.show()

print("Note: High-frequency words are typically function words (articles, prepositions, pronouns)")

## Zipf's Law: The Power Law of Language

**Zipf's Law** is one of the most remarkable discoveries in quantitative linguistics. It states that:

> The frequency of any word is inversely proportional to its rank in the frequency table.

Mathematically: `frequency ∝ 1/rank^α` where α ≈ 1

On a log-log plot, this creates a **straight line**, demonstrating a **power law distribution**.

### Implications:
- A few words are extremely common
- Most words are rare (**hapax legomena**: words appearing only once)
- This pattern is universal across human languages
- Important for compression, information theory, and NLP

In [None]:
# Calculate rank-frequency data for each language
zipf_data = {}

for lang in df['language'].unique():
    freq_counter = language_frequencies[lang]
    
    # Get frequencies in descending order
    frequencies = [count for word, count in freq_counter.most_common()]
    ranks = np.arange(1, len(frequencies) + 1)
    
    zipf_data[lang] = {'ranks': ranks, 'frequencies': frequencies}
    
    # Calculate theoretical Zipf distribution
    # frequency = C / rank (where C is a constant)
    C = frequencies[0]  # Frequency of most common word
    theoretical_freq = C / ranks
    zipf_data[lang]['theoretical'] = theoretical_freq
    
    # Calculate correlation in log space
    log_ranks = np.log10(ranks)
    log_freqs = np.log10(frequencies)
    correlation = np.corrcoef(log_ranks, log_freqs)[0, 1]
    
    print(f"{lang}: Zipf correlation (log-log) = {correlation:.4f}")

In [None]:
# Plot Zipf's Law on log-log scale
fig, axes = plt.subplots(2, 2, figsize=(16, 12))
axes = axes.flatten()
colors = sns.color_palette("husl", 4)

for idx, lang in enumerate(sorted(df['language'].unique())):
    ranks = zipf_data[lang]['ranks']
    frequencies = zipf_data[lang]['frequencies']
    theoretical = zipf_data[lang]['theoretical']
    
    # Plot observed frequencies
    axes[idx].loglog(ranks, frequencies, 'o', alpha=0.6, markersize=4, 
                     color=colors[idx], label='Observed')
    
    # Plot theoretical Zipf distribution
    axes[idx].loglog(ranks, theoretical, '--', linewidth=2, 
                     color='red', label='Theoretical Zipf (1/rank)')
    
    axes[idx].set_xlabel('Rank (log scale)', fontsize=12)
    axes[idx].set_ylabel('Frequency (log scale)', fontsize=12)
    axes[idx].set_title(f"Zipf's Law: {lang}", fontsize=14, fontweight='bold')
    axes[idx].legend()
    axes[idx].grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig('zipfs_law.png', dpi=300, bbox_inches='tight')
plt.show()

print("\nZipf's Law demonstrates a power law: straight line on log-log plot")
print("This universal pattern appears in all human languages!")

In [None]:
# Calculate Zipf exponent (slope) for each language using linear regression in log space
print("Zipf's Law Exponent Analysis:")
print("\nTheoretical Zipf exponent α = 1.0")
print("Actual values often range from 0.8 to 1.2\n")

for lang in sorted(df['language'].unique()):
    ranks = zipf_data[lang]['ranks']
    frequencies = zipf_data[lang]['frequencies']
    
    # Linear regression in log space
    log_ranks = np.log10(ranks)
    log_freqs = np.log10(frequencies)
    
    slope, intercept, r_value, p_value, std_err = stats.linregress(log_ranks, log_freqs)
    
    print(f"{lang}:")
    print(f"  Exponent α: {-slope:.3f}")
    print(f"  R² (fit quality): {r_value**2:.4f}")
    print(f"  Equation: frequency = 10^{intercept:.2f} × rank^{slope:.2f}\n")

## Lexical Diversity: Measuring Vocabulary Richness

**Lexical diversity** (or vocabulary richness) measures how varied the vocabulary is in a text.

### Type-Token Ratio (TTR):
```
TTR = (Number of unique words) / (Total number of words)
TTR = Types / Tokens
```

- **High TTR** (closer to 1.0): Rich, varied vocabulary
- **Low TTR** (closer to 0): Repetitive, limited vocabulary

### Related Concepts:
- **Types**: Unique words in a text
- **Tokens**: Total words (including repetitions)
- **Hapax legomena**: Words appearing exactly once
- **Hapax dislegomena**: Words appearing exactly twice

In [None]:
# Calculate lexical diversity metrics for each language
lexical_diversity = {}

for lang in df['language'].unique():
    freq_counter = language_frequencies[lang]
    
    # Basic counts
    total_tokens = sum(freq_counter.values())
    unique_types = len(freq_counter)
    
    # Type-Token Ratio
    ttr = unique_types / total_tokens
    
    # Hapax legomena (words appearing once)
    hapax = sum(1 for count in freq_counter.values() if count == 1)
    hapax_percentage = (hapax / unique_types) * 100
    
    # Hapax dislegomena (words appearing twice)
    dis_legomena = sum(1 for count in freq_counter.values() if count == 2)
    
    # Store results
    lexical_diversity[lang] = {
        'tokens': total_tokens,
        'types': unique_types,
        'ttr': ttr,
        'hapax': hapax,
        'hapax_pct': hapax_percentage,
        'dis_legomena': dis_legomena
    }
    
    print(f"{lang.upper()}:")
    print(f"  Total tokens: {total_tokens:,}")
    print(f"  Unique types: {unique_types:,}")
    print(f"  Type-Token Ratio: {ttr:.4f}")
    print(f"  Hapax legomena: {hapax:,} ({hapax_percentage:.1f}% of vocabulary)")
    print(f"  Hapax dislegomena: {dis_legomena:,}\n")

In [None]:
# Visualize lexical diversity metrics
fig, axes = plt.subplots(1, 3, figsize=(18, 5))

languages = sorted(df['language'].unique())
colors = sns.color_palette("husl", len(languages))

# Plot 1: Type-Token Ratio
ttrs = [lexical_diversity[lang]['ttr'] for lang in languages]
axes[0].bar(languages, ttrs, color=colors)
axes[0].set_ylabel('Type-Token Ratio', fontsize=12)
axes[0].set_title('Lexical Diversity (TTR)\nHigher = More Varied Vocabulary', 
                  fontsize=14, fontweight='bold')
axes[0].set_ylim(0, max(ttrs) * 1.2)
for i, v in enumerate(ttrs):
    axes[0].text(i, v + 0.01, f'{v:.3f}', ha='center', va='bottom', fontweight='bold')

# Plot 2: Vocabulary size
vocab_sizes = [lexical_diversity[lang]['types'] for lang in languages]
axes[1].bar(languages, vocab_sizes, color=colors)
axes[1].set_ylabel('Vocabulary Size (Types)', fontsize=12)
axes[1].set_title('Total Unique Words', fontsize=14, fontweight='bold')
for i, v in enumerate(vocab_sizes):
    axes[1].text(i, v + max(vocab_sizes)*0.02, f'{v:,}', ha='center', va='bottom', fontweight='bold')

# Plot 3: Hapax legomena percentage
hapax_pcts = [lexical_diversity[lang]['hapax_pct'] for lang in languages]
axes[2].bar(languages, hapax_pcts, color=colors)
axes[2].set_ylabel('Hapax Legomena (%)', fontsize=12)
axes[2].set_title('Words Appearing Only Once\n(% of Vocabulary)', 
                  fontsize=14, fontweight='bold')
for i, v in enumerate(hapax_pcts):
    axes[2].text(i, v + 1, f'{v:.1f}%', ha='center', va='bottom', fontweight='bold')

plt.tight_layout()
plt.savefig('lexical_diversity.png', dpi=300, bbox_inches='tight')
plt.show()

print("\nInterpretation:")
print("- Higher TTR indicates more diverse vocabulary usage")
print("- Hapax legomena are important for vocabulary growth and creativity")
print("- Typically 30-50% of words in a corpus appear only once")

## N-gram Analysis: Discovering Common Phrases

**N-grams** are contiguous sequences of n items (words) from a text.

- **Unigram** (1-gram): Single word
- **Bigram** (2-gram): Two consecutive words
- **Trigram** (3-gram): Three consecutive words

### Applications:
- Identify common phrases and collocations
- Language modeling and prediction
- Text generation
- Statistical machine translation
- Phrase detection in search engines

In [None]:
def extract_ngrams(tokens, n=2):
    """
    Extract n-grams from a list of tokens.
    """
    ngrams = []
    for i in range(len(tokens) - n + 1):
        ngram = ' '.join(tokens[i:i+n])
        ngrams.append(ngram)
    return ngrams

# Calculate bigrams and trigrams for each language
ngram_data = {}

for lang in df['language'].unique():
    # Get all tokens for this language
    all_tokens = []
    for tokens in df[df['language'] == lang]['tokens']:
        all_tokens.extend(tokens)
    
    # Extract bigrams and trigrams
    bigrams = extract_ngrams(all_tokens, n=2)
    trigrams = extract_ngrams(all_tokens, n=3)
    
    # Count frequencies
    bigram_counter = Counter(bigrams)
    trigram_counter = Counter(trigrams)
    
    ngram_data[lang] = {
        'bigrams': bigram_counter,
        'trigrams': trigram_counter
    }
    
    print(f"{lang.upper()} - Most Common Phrases:")
    print("\nTop 10 Bigrams (2-word phrases):")
    for ngram, count in bigram_counter.most_common(10):
        print(f"  '{ngram}': {count}")
    
    print("\nTop 10 Trigrams (3-word phrases):")
    for ngram, count in trigram_counter.most_common(10):
        print(f"  '{ngram}': {count}")
    print("\n" + "="*60 + "\n")

In [None]:
# Visualize most common bigrams
fig, axes = plt.subplots(2, 2, figsize=(18, 14))
axes = axes.flatten()

for idx, lang in enumerate(sorted(df['language'].unique())):
    bigram_counter = ngram_data[lang]['bigrams']
    top_bigrams = bigram_counter.most_common(12)
    
    if top_bigrams:
        phrases, counts = zip(*top_bigrams)
        
        axes[idx].barh(range(len(phrases)), counts, color=sns.color_palette("husl", 4)[idx])
        axes[idx].set_yticks(range(len(phrases)))
        axes[idx].set_yticklabels(phrases, fontsize=10)
        axes[idx].invert_yaxis()
        axes[idx].set_xlabel('Frequency', fontsize=12)
        axes[idx].set_title(f'Top Bigrams: {lang}', fontsize=14, fontweight='bold')
        axes[idx].grid(axis='x', alpha=0.3)

plt.tight_layout()
plt.savefig('bigrams.png', dpi=300, bbox_inches='tight')
plt.show()

print("Note: Common bigrams often include function words (articles, prepositions)")
print("These reveal grammatical patterns specific to each language")

## Cross-linguistic Comparison

Now we'll compare statistical properties across all four languages to identify:
1. **Universal patterns** (similar across all languages)
2. **Language-specific characteristics** (unique to each language)

We'll examine:
- Average word length
- Vocabulary size
- Frequency distribution patterns
- Type-token ratios

In [None]:
# Calculate average word length for each language
comparison_data = {}

for lang in df['language'].unique():
    all_tokens = []
    for tokens in df[df['language'] == lang]['tokens']:
        all_tokens.extend(tokens)
    
    # Word length statistics
    word_lengths = [len(word) for word in all_tokens]
    avg_length = np.mean(word_lengths)
    median_length = np.median(word_lengths)
    std_length = np.std(word_lengths)
    
    # Frequency distribution stats
    frequencies = list(language_frequencies[lang].values())
    
    comparison_data[lang] = {
        'avg_word_length': avg_length,
        'median_word_length': median_length,
        'std_word_length': std_length,
        'word_length_distribution': word_lengths,
        'frequency_distribution': frequencies
    }
    
    print(f"{lang}:")
    print(f"  Average word length: {avg_length:.2f} characters")
    print(f"  Median word length: {median_length:.1f} characters")
    print(f"  Std deviation: {std_length:.2f}\n")

In [None]:
# Visualize word length distributions
fig, axes = plt.subplots(2, 2, figsize=(16, 12))
axes = axes.flatten()
colors = sns.color_palette("husl", 4)

for idx, lang in enumerate(sorted(df['language'].unique())):
    word_lengths = comparison_data[lang]['word_length_distribution']
    
    axes[idx].hist(word_lengths, bins=30, color=colors[idx], alpha=0.7, edgecolor='black')
    axes[idx].axvline(comparison_data[lang]['avg_word_length'], 
                      color='red', linestyle='--', linewidth=2, 
                      label=f"Mean: {comparison_data[lang]['avg_word_length']:.2f}")
    axes[idx].set_xlabel('Word Length (characters)', fontsize=12)
    axes[idx].set_ylabel('Frequency', fontsize=12)
    axes[idx].set_title(f'Word Length Distribution: {lang}', fontsize=14, fontweight='bold')
    axes[idx].legend()
    axes[idx].grid(alpha=0.3)

plt.tight_layout()
plt.savefig('word_length_distributions.png', dpi=300, bbox_inches='tight')
plt.show()

In [None]:
# Compare vocabulary growth across languages
plt.figure(figsize=(12, 7))

for idx, lang in enumerate(sorted(df['language'].unique())):
    # Get all tokens for this language
    all_tokens = []
    for tokens in df[df['language'] == lang]['tokens']:
        all_tokens.extend(tokens)
    
    # Calculate cumulative vocabulary size
    vocab_growth = []
    seen_words = set()
    sample_points = np.linspace(0, len(all_tokens), 100, dtype=int)
    
    for point in sample_points:
        seen_words.update(all_tokens[:point])
        vocab_growth.append(len(seen_words))
    
    plt.plot(sample_points, vocab_growth, marker='o', linewidth=2, 
             label=lang, color=sns.color_palette("husl", 4)[idx])

plt.xlabel('Number of Tokens', fontsize=13)
plt.ylabel('Vocabulary Size (Unique Words)', fontsize=13)
plt.title('Vocabulary Growth Across Languages\n(Heaps\' Law)', 
          fontsize=15, fontweight='bold')
plt.legend(fontsize=11)
plt.grid(alpha=0.3)
plt.tight_layout()
plt.savefig('vocabulary_growth.png', dpi=300, bbox_inches='tight')
plt.show()

print("Heaps' Law: Vocabulary size grows as V = K × N^β")
print("where N is text length and β is typically 0.4-0.6")
print("This shows diminishing returns in vocabulary growth")

## Summary Statistics: Comprehensive Language Comparison

Let's compile all our findings into a comprehensive comparison table showing the statistical properties of each language.

In [None]:
# Create comprehensive summary table
summary_data = []

for lang in sorted(df['language'].unique()):
    ld = lexical_diversity[lang]
    cd = comparison_data[lang]
    
    # Calculate additional metrics
    freq_counter = language_frequencies[lang]
    
    # Top 10 words coverage (what % of text is covered by top 10 words)
    top10_freq = sum(count for word, count in freq_counter.most_common(10))
    top10_coverage = (top10_freq / ld['tokens']) * 100
    
    # Top 100 words coverage
    top100_freq = sum(count for word, count in freq_counter.most_common(100))
    top100_coverage = (top100_freq / ld['tokens']) * 100
    
    summary_data.append({
        'Language': lang,
        'Total Tokens': f"{ld['tokens']:,}",
        'Vocabulary Size': f"{ld['types']:,}",
        'Type-Token Ratio': f"{ld['ttr']:.4f}",
        'Avg Word Length': f"{cd['avg_word_length']:.2f}",
        'Hapax Legomena': f"{ld['hapax']:,} ({ld['hapax_pct']:.1f}%)",
        'Top 10 Coverage': f"{top10_coverage:.1f}%",
        'Top 100 Coverage': f"{top100_coverage:.1f}%"
    })

summary_df = pd.DataFrame(summary_data)
print("\n" + "="*100)
print("COMPREHENSIVE STATISTICAL LINGUISTICS SUMMARY")
print("="*100 + "\n")
print(summary_df.to_string(index=False))
print("\n" + "="*100)

In [None]:
# Create a visual summary dashboard
fig = plt.figure(figsize=(18, 10))
gs = fig.add_gridspec(2, 3, hspace=0.3, wspace=0.3)

languages = sorted(df['language'].unique())
colors = sns.color_palette("husl", len(languages))

# Plot 1: Total tokens
ax1 = fig.add_subplot(gs[0, 0])
tokens = [lexical_diversity[lang]['tokens'] for lang in languages]
ax1.bar(languages, tokens, color=colors)
ax1.set_title('Total Tokens', fontweight='bold', fontsize=12)
ax1.set_ylabel('Count')
ax1.tick_params(axis='x', rotation=45)

# Plot 2: Vocabulary size
ax2 = fig.add_subplot(gs[0, 1])
types = [lexical_diversity[lang]['types'] for lang in languages]
ax2.bar(languages, types, color=colors)
ax2.set_title('Vocabulary Size', fontweight='bold', fontsize=12)
ax2.set_ylabel('Unique Words')
ax2.tick_params(axis='x', rotation=45)

# Plot 3: Type-Token Ratio
ax3 = fig.add_subplot(gs[0, 2])
ttrs = [lexical_diversity[lang]['ttr'] for lang in languages]
ax3.bar(languages, ttrs, color=colors)
ax3.set_title('Type-Token Ratio', fontweight='bold', fontsize=12)
ax3.set_ylabel('TTR')
ax3.tick_params(axis='x', rotation=45)

# Plot 4: Average word length
ax4 = fig.add_subplot(gs[1, 0])
avg_lengths = [comparison_data[lang]['avg_word_length'] for lang in languages]
ax4.bar(languages, avg_lengths, color=colors)
ax4.set_title('Average Word Length', fontweight='bold', fontsize=12)
ax4.set_ylabel('Characters')
ax4.tick_params(axis='x', rotation=45)

# Plot 5: Hapax percentage
ax5 = fig.add_subplot(gs[1, 1])
hapax_pcts = [lexical_diversity[lang]['hapax_pct'] for lang in languages]
ax5.bar(languages, hapax_pcts, color=colors)
ax5.set_title('Hapax Legomena %', fontweight='bold', fontsize=12)
ax5.set_ylabel('Percentage')
ax5.tick_params(axis='x', rotation=45)

# Plot 6: Comparison radar chart
ax6 = fig.add_subplot(gs[1, 2], projection='polar')

# Normalize metrics for radar chart (0-1 scale)
metrics = ['TTR', 'Avg Length', 'Hapax %']
angles = np.linspace(0, 2 * np.pi, len(metrics), endpoint=False).tolist()
angles += angles[:1]

for idx, lang in enumerate(languages):
    values = [
        lexical_diversity[lang]['ttr'] / max(ttrs),
        comparison_data[lang]['avg_word_length'] / max(avg_lengths),
        lexical_diversity[lang]['hapax_pct'] / max(hapax_pcts)
    ]
    values += values[:1]
    ax6.plot(angles, values, 'o-', linewidth=2, label=lang, color=colors[idx])
    ax6.fill(angles, values, alpha=0.15, color=colors[idx])

ax6.set_xticks(angles[:-1])
ax6.set_xticklabels(metrics)
ax6.set_ylim(0, 1)
ax6.set_title('Normalized Metrics', fontweight='bold', fontsize=12, pad=20)
ax6.legend(loc='upper right', bbox_to_anchor=(1.3, 1.1))
ax6.grid(True)

fig.suptitle('Statistical Linguistics: Comprehensive Language Comparison', 
             fontsize=16, fontweight='bold', y=0.98)

plt.savefig('summary_dashboard.png', dpi=300, bbox_inches='tight')
plt.show()

## Key Findings and Conclusions

### Universal Patterns (Observed Across All Languages):

1. **Zipf's Law Holds**: All languages show power law distribution (straight line on log-log plot)
2. **High-Frequency Function Words**: Articles, prepositions, and pronouns dominate
3. **Hapax Legomena**: ~30-50% of vocabulary appears only once
4. **Vocabulary Growth**: Follows Heaps' Law (sublinear growth)

### Language-Specific Observations:

- **German**: Longer average word length (compound words)
- **Japanese**: Different tokenization requirements (character-based)
- **English/Spanish**: Similar TTR and frequency patterns
- **Morphology Effects**: Inflected languages may show different TTR

### Applications in NLP:

1. **Language Modeling**: Frequency distributions inform probability models
2. **Text Compression**: Zipf's Law enables efficient encoding
3. **Information Retrieval**: TF-IDF based on frequency principles
4. **Machine Translation**: N-gram patterns reveal translation units
5. **Authorship Attribution**: Lexical diversity as stylistic marker

### Further Exploration:

- **Pointwise Mutual Information (PMI)**: Measure word association strength
- **Perplexity**: Evaluate language model quality
- **Entropy**: Measure linguistic uncertainty
- **Syllable-based metrics**: For phonological analysis
- **Morphological analysis**: Study word formation patterns

In [None]:
# Final summary statistics
print("\n" + "="*100)
print("COMPUTATIONAL LINGUISTICS ANALYSIS COMPLETE")
print("="*100)
print("\nKey Statistical Concepts Demonstrated:")
print("  ✓ Zipf's Law - Power law distribution of word frequencies")
print("  ✓ Type-Token Ratio - Lexical diversity measurement")
print("  ✓ Hapax Legomena - Rare word analysis")
print("  ✓ N-gram Analysis - Phrase detection and collocation")
print("  ✓ Heaps' Law - Vocabulary growth patterns")
print("  ✓ Cross-linguistic Comparison - Universal vs specific patterns")
print("\nVisualizations Generated:")
print("  • Word frequency distributions")
print("  • Zipf's law log-log plots")
print("  • Lexical diversity metrics")
print("  • N-gram frequency charts")
print("  • Word length distributions")
print("  • Vocabulary growth curves")
print("  • Comprehensive summary dashboard")
print("\n" + "="*100)