# 4. Factor & Table Computation

**Summary**: Computes derived metrics like Hierarchical Complexity Scores (HCS), growth factors, and generates the Verb-Centered Tables.

**Key Steps**:
1. Compute HCS factors (ratios between adjacent positions).
2. Generate Verb-Centered **Helix** Tables (incorporating growth factors).
3. Merge Head-Initiality data.

**Inputs**:
- `data/all_langs_average_sizes.pkl`
- `data/sentence_disorder_percentages.pkl` (if available)

**Outputs**:
- `data/hcs_factors.csv`
- `data/verb_centered_table.txt`
- `data/verb_centered_table_with_factors.tsv`

**Runtime**: ~10-20 seconds

---

In [1]:
# !pip install openpyxl

In [2]:
import os
import pandas as pd
import numpy as np
import pickle
from importlib import reload

# Custom modules
import data_utils
import compute_factors
import verb_centered_analysis

# Reload to ensure latest changes are picked up
reload(compute_factors)
reload(verb_centered_analysis)

import psutil
import conll_processing
reload(conll_processing)


<module 'conll_processing' from '/bigstorage/kim/typometrics/dataanalysis/conll_processing.py'>

In [3]:
# Configuration
DATA_DIR = "data"
OUTPUT_DIR = "data"

## 1. Load Data

In [4]:
metadata = data_utils.load_metadata(os.path.join(DATA_DIR, 'metadata.pkl'))
langNames = metadata['langNames']
langnameGroup = metadata['langnameGroup']

print(f"Loaded metadata for {len(langNames)} languages")

Loaded metadata from data/metadata.pkl
Loaded metadata for 187 languages


In [5]:
# --------------------------------------------------------------------------------
# DATA EXTRACTION & COMPUTATION
# --------------------------------------------------------------------------------
# Instead of just loading pickles, we now run the full extraction pipeline.
# This ensures all statistics, including config examples, are fresh and consistent.

# 1. Flatten file list
print("Flattening file list...")
langShortConllFiles = metadata['langShortConllFiles']
allshortconll = []
for lang, files in langShortConllFiles.items():
    allshortconll.extend(files)

print(f"Processing {len(allshortconll)} files on {psutil.cpu_count()} cores... this takes about one minute")

# 2. Run Parallel Extraction
compute_sentence_disorder = True
collect_config_examples = True
max_examples_per_config = 25

results = conll_processing.get_all_stats_parallel(
    allshortconll,
    include_bastards=True,
    compute_sentence_disorder=compute_sentence_disorder,
    collect_config_examples=collect_config_examples,
    max_examples_per_config=max_examples_per_config
)

# 3. Unpack Results
print("Processing complete. Unpacking results...")
(all_langs_position2num, all_langs_position2sizes, all_langs_average_sizes, all_langs_average_charsizes,
 lang_bastard_stats, global_bastard_relations, 
 lang_vo_hi_scores, 
 sentence_disorder_pct,
 all_config_examples) = results

# 4. Save Results (for compatibility and persistence)
print("Saving results components...")

# Average Sizes
with open(os.path.join(DATA_DIR, 'all_langs_average_sizes.pkl'), 'wb') as f:
    pickle.dump(all_langs_average_sizes, f)

# Assign to variable expected by downstream cells
all_langs_average_sizes_filtered = all_langs_average_sizes

# Save filtered version (legacy support)
with open(os.path.join(DATA_DIR, 'all_langs_average_sizes_filtered.pkl'), 'wb') as f:
    pickle.dump(all_langs_average_sizes_filtered, f)

# Save Char Sizes
with open(os.path.join(DATA_DIR, 'all_langs_average_charsizes.pkl'), 'wb') as f:
    pickle.dump(all_langs_average_charsizes, f)

# Save Position Sizes
with open(os.path.join(DATA_DIR, 'all_langs_position2sizes.pkl'), 'wb') as f:
    pickle.dump(all_langs_position2sizes, f)

# Save Position Counts
with open(os.path.join(DATA_DIR, 'all_langs_position2num.pkl'), 'wb') as f:
    pickle.dump(all_langs_position2num, f)
    
# Save Disorder Stats (Ordering Stats)
with open(os.path.join(DATA_DIR, 'sentence_disorder_percentages.pkl'), 'wb') as f:
    pickle.dump(sentence_disorder_pct, f)
    
# Config Examples
if all_config_examples is not None:
    config_examples_path = os.path.join(DATA_DIR, 'all_config_examples.pkl')
    with open(config_examples_path, 'wb') as f:
        pickle.dump(all_config_examples, f)
    print(f"Saved configuration examples to {config_examples_path}")

print(f"Data extraction complete. Ready for analysis.")


Flattening file list...
Processing 810 files on 80 cores... this takes about one minute
Starting unified processing on 80 cores


Processing files: 100%|██████████| 810/810 [01:07<00:00, 12.05it/s]


Finished processing. Combining results...
Done!
Processing complete. Unpacking results...
Saving results components...
Saved configuration examples to data/all_config_examples.pkl
Data extraction complete. Ready for analysis.


## 2. Compute HCS Factors

In [6]:
hcs_df = compute_factors.compute_hcs_factors(
    all_langs_average_sizes_filtered, 
    langNames, 
    langnameGroup
)

print(f"Computed HCS factors for {len(hcs_df)} languages")
print(hcs_df.head())

Computed HCS factors for 171 languages
    language_code language_name           group  right_1_totright_2  \
81             ko        Korean           Other            1.824906   
117           pad       Paumarí  South-American            2.139826   
147            tn        Tswana     Niger-Congo            2.213364   
138           ssp   SpanishSign   Indo-European            1.525921   
71             ja      Japanese           Other            1.000000   

     right_2_totright_2  hcs_factor  
81             1.031262    0.565104  
117            1.414214    0.660901  
147            1.778279    0.803428  
138            1.389013    0.910278  
71             1.000000    1.000000  


In [7]:
hcs_path = os.path.join(OUTPUT_DIR, 'hcs_factors.csv')
hcs_df.to_csv(hcs_path, index=False)
print(f"Saved HCS factors to {hcs_path}")

Saved HCS factors to data/hcs_factors.csv


## 3. Helix tables (Verb-Centered Constituent Size Analysis)

Verb-Centered Constituent Size Analysis = a helix table with constituent size averages X per construction of dependents to the right and the left of the verb: 
      VXXXX
      VXXX
      VXX
      VX
     XV
    XXV
   XXXV
  XXXXV

This table should come in multiple options:

1. **Simple**: Just showing the average constituent size X per construction of dependents to the right and the left of the verb
2. **Horizontal Growth**: Between two Xs, the factor of growth going from left to right
3. **Horizontal Growth**: Between two Xs, the factor of growth going from right to left
4. **Diagonal Growth** added as an extra line: Going up right, the growth factor going between the last X of one to the next construction, such as between the last X of VXXX to the last X of VXXXX etc.
5. The same as 4 but going down left.

### AnyOtherSide Tables

In addition to exact configurations (e.g., VXX = 0 left, 2 right), we also generate **AnyOtherSide** tables that show patterns where one direction is ignored:

- **Any-Left patterns** (`... V X X`): Any number of left dependents, N right dependents
  - Example: `VXX_anyleft` matches verbs with exactly 2 right dependents, regardless of left side complexity
  - Tot dimension: Each row represents verbs with exactly N right dependents (tot=N on right), enabling diagonal factor computation
- **Any-Right patterns** (`X X V ...`): N left dependents, any number of right dependents  
  - Example: `XXV_anyright` matches verbs with exactly 2 left dependents, regardless of right side complexity
  - Tot dimension: Each row represents verbs with exactly N left dependents (tot=N on left), enabling diagonal factor computation
- **Any-Both pattern** (`... X V X ...`): Bilateral with any total counts
  - Example: `XVX_anyboth` matches verbs with 1 dependent on each side, any totals

These tables include both **horizontal growth factors** (comparing adjacent positions, e.g., R1→R2) and **diagonal growth factors** (comparing positions across tot levels, e.g., R2 at tot=2 vs R1 at tot=1). They are generated in the same TSV/XLSX formats as standard Helix tables.

In [8]:
# Verb-Centered Constituent Size Analysis
position_averages = verb_centered_analysis.compute_average_sizes_table(all_langs_average_sizes_filtered)
# Table saved in mass generation step below

  return compute_sizes_table(all_langs_average_sizes_filtered, table_type='standard')


## 4. Generate Helix Tables (Mass Generation)

Generate helix tables for all languages, families, and order types.

This includes:
- **Standard Helix Tables**: Exact configurations (VXX, XXV, etc.) with growth factors
- **AnyOtherSide Tables**: Partial configurations ignoring one direction (... V X X, X X V ..., etc.)
- **Global, Individual, Family-based, and Order-based** variants

Tables are generated in both TSV and XLSX formats.

In [9]:
# ## OPTIONAL: Compute Disorder Percentages
# ## Uncomment to compute disorder statistics

# import compute_disorder
# from importlib import reload
# reload(compute_disorder)

# # Compute disorder statistics with granular stats if available
# import pickle

# # Load granular ordering stats
# ordering_stats = {}
# disorder_pct_path = 'data/sentence_disorder_percentages.pkl'
# if os.path.exists(disorder_pct_path):
#     with open(disorder_pct_path, 'rb') as f:
#         ordering_stats = pickle.load(f)
#     print(f"Loaded granular ordering stats for {len(ordering_stats)} languages")

# disorder_df, disorder_percentages = compute_disorder.compute_disorder_statistics(
#     all_langs_average_sizes_filtered,
#     langNames,
#     langnameGroup,
#     ordering_stats=ordering_stats
# )
# print(f"Computed disorder for {len(disorder_df)} languages")
# print("\nDisorder percentages by configuration:")
# for (side, tot), pct in sorted(disorder_percentages.items()):
#     if pct is not None:
#         print(f"  {side} tot={tot}: {pct:.1f}% disordered")

# print("\n", disorder_df.head())

In [10]:
import pandas as pd
import pickle
from importlib import reload
import verb_centered_analysis

# Reload the module to use the latest table formatting logic
reload(verb_centered_analysis)

OUTPUT_TABLE_DIR = os.path.join(DATA_DIR, 'tables')

print("--- Starting Mass Table Generation ---")

# 1. Load Average Sizes (should already be loaded from previous cells)
if 'all_langs_average_sizes_filtered' not in locals():
    avg_path = os.path.join(DATA_DIR, 'all_langs_average_sizes_filtered.pkl')
    if os.path.exists(avg_path):
        with open(avg_path, 'rb') as f:
            all_langs_average_sizes_filtered = pickle.load(f)
        print("Loaded average sizes from disk.")
    else:
        print("ERROR: 'all_langs_average_sizes_filtered' not found. Run previous cells first.")

# 2. Load Ordering Statistics (optional - for ordering triples)
ordering_stats = {}
disorder_path = os.path.join(DATA_DIR, 'sentence_disorder_percentages.pkl')

if os.path.exists(disorder_path):
    with open(disorder_path, 'rb') as f:
        loaded_data = pickle.load(f)
        
    # Check if the data is in the new format (triplet counts)
    sample_lang = next(iter(loaded_data))
    sample_keys = list(loaded_data[sample_lang].keys()) if loaded_data[sample_lang] else []
    
    if sample_keys and len(sample_keys[0]) == 3:
        print("Loaded Ordering Stats (Triplets) successfully.")
        ordering_stats = loaded_data
    else:
        print("WARNING: Loaded data seems to be in OLD format. Triples cannot be shown.")
else:
    print(f"WARNING: {disorder_path} not found. Tables will be generated without triples.")

# 3. Load VO/OV Data (optional - for order-based tables)
vo_data = {}
vo_path = os.path.join(DATA_DIR, 'vo_vs_hi_scores.csv')
if os.path.exists(vo_path):
    vo_df = pd.read_csv(vo_path)
    for _, row in vo_df.iterrows():
        vo_data[row['language_code']] = row.to_dict()
    print("Loaded VO/OV classifications.")

# 4. Generate Tables AND Extract Disorder Metrics
# This will generate: Global, Individual, Family-based, and Order-based tables.
# Also extracts disorder extreme aggregate percentages for each language
disorder_df = verb_centered_analysis.generate_mass_tables(
    all_langs_average_sizes_filtered,
    ordering_stats,
    metadata,  # Ensure metadata is loaded (from Cell 6)
    vo_data=vo_data,
    output_dir=OUTPUT_TABLE_DIR,
    arrow_direction='left_to_right',
    extract_disorder_metrics=True
)

print(f"--- Completed. Tables saved to {OUTPUT_TABLE_DIR}/ ---")

# Save disorder metrics if extracted
if disorder_df is not None and len(disorder_df) > 0:
    disorder_csv_path = os.path.join(DATA_DIR, 'disorder_extreme_aggregates.csv')
    disorder_df.to_csv(disorder_csv_path, index=False)
    print(f"\nExtracted and saved disorder metrics for {len(disorder_df)} languages to {disorder_csv_path}")
    print("\nSample disorder metrics:")
    print(disorder_df.head(10))


--- Starting Mass Table Generation ---
Loaded Ordering Stats (Triplets) successfully.
Loaded VO/OV classifications.
Generating Global Table...


  return compute_sizes_table(all_langs_average_sizes_filtered, table_type='standard')
  return compute_sizes_table(all_langs_average_sizes_filtered, table_type='standard')
  return compute_sizes_table(all_langs_average_sizes_filtered, table_type='standard')
  return compute_sizes_table(all_langs_average_sizes_filtered, table_type='standard')
  return compute_sizes_table(all_langs_average_sizes_filtered, table_type='standard')
  return compute_sizes_table(all_langs_average_sizes_filtered, table_type='standard')
  return compute_sizes_table(all_langs_average_sizes_filtered, table_type='standard')
  return compute_sizes_table(all_langs_average_sizes_filtered, table_type='standard')
  return compute_sizes_table(all_langs_average_sizes_filtered, table_type='standard')
  return compute_sizes_table(all_langs_average_sizes_filtered, table_type='standard')
  return compute_sizes_table(all_langs_average_sizes_filtered, table_type='standard')
  return compute_sizes_table(all_langs_average_sizes_f

Generating Family Tables (10 families)...
Generating Individual Language Tables...
Calculating disorder metrics...


  return compute_sizes_table(all_langs_average_sizes_filtered, table_type='standard')
  return compute_sizes_table(all_langs_average_sizes_filtered, table_type='standard')
  return compute_sizes_table(all_langs_average_sizes_filtered, table_type='standard')
  return compute_sizes_table(all_langs_average_sizes_filtered, table_type='standard')
  return compute_sizes_table(all_langs_average_sizes_filtered, table_type='standard')
  return compute_sizes_table(all_langs_average_sizes_filtered, table_type='standard')
  return compute_sizes_table(all_langs_average_sizes_filtered, table_type='standard')
  return compute_sizes_table(all_langs_average_sizes_filtered, table_type='standard')
  return compute_sizes_table(all_langs_average_sizes_filtered, table_type='standard')
  return compute_sizes_table(all_langs_average_sizes_filtered, table_type='standard')
  return compute_sizes_table(all_langs_average_sizes_filtered, table_type='standard')
  return compute_sizes_table(all_langs_average_sizes_f

Generating Any-Other-Side Tables...
  Processing 186 languages...


  position_data = compute_sizes_table(single_lang_data, table_type='anyotherside')
  position_data = compute_sizes_table(single_lang_data, table_type='anyotherside')
  position_data = compute_sizes_table(single_lang_data, table_type='anyotherside')
  position_data = compute_sizes_table(single_lang_data, table_type='anyotherside')
  position_data = compute_sizes_table(single_lang_data, table_type='anyotherside')
  position_data = compute_sizes_table(single_lang_data, table_type='anyotherside')
  position_data = compute_sizes_table(single_lang_data, table_type='anyotherside')
  position_data = compute_sizes_table(single_lang_data, table_type='anyotherside')
  position_data = compute_sizes_table(single_lang_data, table_type='anyotherside')
  position_data = compute_sizes_table(single_lang_data, table_type='anyotherside')
  position_data = compute_sizes_table(single_lang_data, table_type='anyotherside')
  position_data = compute_sizes_table(single_lang_data, table_type='anyotherside')
  po

  Generated Any-Other-Side tables for 186 languages
--- Completed. Tables saved to data/tables/ ---

Extracted and saved disorder metrics for 186 languages to data/disorder_extreme_aggregates.csv

Sample disorder metrics:
  language_code  language_name           group  left_tot_2_disordered  \
0           abq          Abaza       Caucasian               0.604167   
1            ab         Abkhaz       Caucasian               0.629380   
2            af      Afrikaans   Indo-European               0.765786   
3           akk       Akkadian     Afroasiatic               0.511247   
4           aqz        Akuntsu  South-American               0.884615   
5            sq       Albanian   Indo-European               0.637168   
6           gsw    SwissGerman   Indo-European               0.819608   
7            am        Amharic     Afroasiatic               0.422430   
8           grc   AncientGreek   Indo-European               0.810120   
9           hbo  AncientHebrew     Afroasiatic  




## 5. Configuration Example Creation

Generate interactive HTML visualizations of verb configurations from examples collected during data extraction.

This includes both **exact configurations** (VXX, XXV, etc.) and **partial configurations** (VXX_anyleft, XXV_anyright, XVX_anyboth):

- **Exact configs**: Match specific left and right dependent counts (e.g., VXX = 0 left, 2 right)
- **Partial configs**: Match one side exactly while ignoring the other side
  - `VXX_anyleft`: 2 right dependents, any number of left dependents (shows only right side in HTML)
  - `XXV_anyright`: 2 left dependents, any number of right dependents (shows only left side in HTML)
  - `XVX_anyboth`: 1 dependent on each side, any total counts (shows both sides)

**Note**: Examples are automatically collected by `run_data_extraction.py` using the same constraints as constituent size computation (same dependency relations, bastard inclusion, etc.). This ensures consistency and avoids duplicate CoNLL file parsing.

In [11]:
# Generate HTML visualizations from configuration examples collected during data extraction
# Examples are automatically collected in run_data_extraction.py using the same constraints
# as constituent size computation (same dependency relations, bastard inclusion, etc.)

import generate_html_examples
from importlib import reload
reload(generate_html_examples)

print("Generating HTML visualizations from saved configuration examples...")
print("="*60)

# Check if examples have been collected
examples_path = os.path.join(DATA_DIR, 'all_config_examples.pkl')
if not os.path.exists(examples_path):
    print(f"ERROR: Configuration examples not found at {examples_path}")
    print("\nPlease run data extraction first:")
    print("  python3 run_data_extraction.py")
    print("\nThis will collect examples during the main processing pipeline.")
else:
    # Generate HTML from saved examples
    output_dir = 'html_examples'
    generate_html_examples.generate_all_html(
        data_dir=DATA_DIR,
        output_dir=output_dir
    )
    
    print(f"\n{'='*60}")
    print(f"Configuration Examples Generated Successfully!")
    print(f"{'='*60}")
    print(f"Output directory: {output_dir}/")
    print(f"Open {output_dir}/index.html to browse examples")
    print(f"\nFeatures:")
    print(f"  • Interactive dependency trees with reactive-dep-tree")
    print(f"  • Verbs highlighted in red, dependents in green")
    print(f"  • Organized by language with 3-column layout")
    print(f"  • Same constraints as constituent size computation")
    print(f"{'='*60}")

Generating HTML visualizations from saved configuration examples...
Loading configuration examples...
Loading metadata...
Loading position counts...
Loaded position counts for 186 languages
Loading average sizes (helix stats)...
Loaded average sizes for 186 languages
Generating HTML for 186 languages...


Generating HTML files: 100%|██████████| 186/186 [00:12<00:00, 15.38it/s]


Generating index...
Generated index at html_examples/index.html
Done! HTML files saved to html_examples/

Configuration Examples Generated Successfully!
Output directory: html_examples/
Open html_examples/index.html to browse examples

Features:
  • Interactive dependency trees with reactive-dep-tree
  • Verbs highlighted in red, dependents in green
  • Organized by language with 3-column layout
  • Same constraints as constituent size computation


In [12]:
# Run extraction for Rhapsodie Treebank (Custom Extraction)
# This uses the dedicated extraction script to process Rhapsodie files directly

import extract_treebank_configs
from importlib import reload
reload(extract_treebank_configs)

# Run extraction for Rhapsodie
extract_treebank_configs.main_func(
    input_dir="/bigstorage/kim/typometrics/dataanalysis/ud-treebanks-v2.17/UD_French-Rhapsodie/",
    output_dir="html_examples",
    treebank_name="French_Rhapsodie_UD",
    max_examples=1_000_000
)

Found 3 files in /bigstorage/kim/typometrics/dataanalysis/ud-treebanks-v2.17/UD_French-Rhapsodie/
Processing with MAX_EXAMPLES=1000000


Processing files: 100%|██████████| 3/3 [00:00<00:00,  3.26it/s]


Processing complete. Generating HTML...
Done! Generated 47 configuration files with 11261 total examples.
Output directory: /bigstorage/kim/typometrics/dataanalysis/html_examples/French_Rhapsodie_UD


In [13]:
# Test updated table with all fixes
import importlib
import sys

# Reload all verb-centered modules
if 'verb_centered_model' in sys.modules:
    importlib.reload(sys.modules['verb_centered_model'])
if 'verb_centered_computations' in sys.modules:
    importlib.reload(sys.modules['verb_centered_computations'])
if 'verb_centered_layout' in sys.modules:
    importlib.reload(sys.modules['verb_centered_layout'])
if 'verb_centered_builder' in sys.modules:
    importlib.reload(sys.modules['verb_centered_builder'])
if 'verb_centered_formatters' in sys.modules:
    importlib.reload(sys.modules['verb_centered_formatters'])
if 'verb_centered_analysis' in sys.modules:
    importlib.reload(sys.modules['verb_centered_analysis'])

from verb_centered_analysis import create_verb_centered_table
from verb_centered_model import TableConfig
from verb_centered_formatters import TextTableFormatter

config_debug = TableConfig(
    show_horizontal_factors=True,
    show_diagonal_factors=True,
    show_ordering_triples=True,
    show_row_averages=True,
    show_marginal_means=True,
    arrow_direction='left_to_right'
)

# For testing, we need language-specific data
# Get English data if available
test_lang = 'en'  # Try different language codes
english_position_avg = None
english_ordering = None

for lang_code in all_langs_average_sizes_filtered.keys():
    if lang_code.startswith('en'):
        english_position_avg = all_langs_average_sizes_filtered[lang_code]
        english_ordering = ordering_stats.get(lang_code, {})
        print(f"Using language: {lang_code}")
        break

# Fallback to global if no English found
if english_position_avg is None:
    print("No English data found, using global averages (no ordering triples)")
    english_position_avg = position_averages
    english_ordering = None

# Create table structure
table_struct = create_verb_centered_table(
    position_averages=english_position_avg,
    ordering_stats=english_ordering,
    hcs_row=None,
    config=config_debug,
    output_format='struct'
)

# Format as text
formatter = TextTableFormatter(table_struct)
table_txt = formatter.format()

# Print only lower half to check
lines = table_txt.split('\n')
print('\n'.join(lines[10:]))  # Skip upper half to focus on lower half

Using language: en
Diag R2-1                                                                                                                     ×1.11↗                                                                                 
R tot=1                                                                                                  V        3.349                                                                                  [GM: 3.349 | N=72090]
------------------------------------------------------------------------------------------------------------------------
X V X                                                                                       1.258   ×2.80→ (<74=20>6)   3.519                                                                                           
------------------------------------------------------------------------------------------------------------------------
L tot=1                                                                              

In [14]:
# Check ordering_stats for English
if 'en_ewt' in ordering_stats:
    print("English ordering stats found:")
    for key, value in sorted(ordering_stats['en_ewt'].items()):
        print(f"  {key}: {value}")
else:
    print("No English ordering stats found")
    print(f"Available languages: {list(ordering_stats.keys())[:5]}")

No English ordering stats found
Available languages: ['abq', 'ab', 'af', 'akk', 'aqz']


In [15]:
# Check what position_averages contains
print("position_averages type:", type(position_averages))
if isinstance(position_averages, dict):
    print("First few keys:", list(position_averages.keys())[:10])

position_averages type: <class 'dict'>
First few keys: ['left_1', 'left_1_totleft_1', 'left_1_anyother', 'left_1_anyother_totleft_1', 'average_totleft_1', 'left_1_totleft_2', 'left_1_anyother_totleft_2', 'left_2', 'left_2_totleft_2', 'left_2_anyother']


In [16]:
# Print full table to see both aggregate rows
print(table_txt)

Row                         L4                    L3                    L2                    L1         V          R1                    R2                    R3                    R4                 [GM | N | Slope]
M Vert Right                                                                                                      2.032                 2.765                 3.698                 5.257                         
M Diag Right                                                                                                                             1.99       ×1.33↗        2.33       ×1.31↗        4.17       ×1.16↗               
Agg R Last→                                                                                                                                                                                              [(<68.4=16.4>15.2) N=33542]
------------------------------------------------------------------------------------------------------------------------
R

In [17]:
# Disorder metrics already extracted in previous cell during mass table generation
# Check if disorder_df exists and display summary
if 'disorder_df' in locals() and disorder_df is not None:
    print(f"Disorder metrics available for {len(disorder_df)} languages")
    print(f"Right extreme disorder range: {disorder_df['right_extreme_disorder'].min():.1f}% - {disorder_df['right_extreme_disorder'].max():.1f}%")
    print(f"Left extreme disorder range: {disorder_df['left_extreme_disorder'].min():.1f}% - {disorder_df['left_extreme_disorder'].max():.1f}%")
else:
    print("No disorder metrics available. Run the mass table generation cell first.")

Disorder metrics available for 186 languages
Right extreme disorder range: 0.0% - 1.0%
Left extreme disorder range: 0.3% - 1.0%


In [18]:
## Plot All Disorder Metrics and Diagonal Factors
# Create comprehensive scatter plots comparing:
# - Disorder percentages vs VO scores
# - Diagonal growth factors vs VO scores  
# - Diagonal factors vs disorder percentages (right and left)

import plotting
from importlib import reload
reload(plotting)

if 'disorder_df' in locals() and disorder_df is not None:
    print("Generating all disorder and diagonal factor plots. This takes a staggering 2 minutes 10...")
    
    saved_plots = plotting.plot_disorder_metrics_vs_vo(
        disorder_df=disorder_df,
        langnameGroup=langnameGroup,
        appearance_dict=metadata['appearance_dict'],
        data_dir=DATA_DIR,
        plots_dir='plots'
    )
    
    if saved_plots:
        print(f"\n{'='*60}")
        print(f"Successfully created {len(saved_plots)} plots:")
        for plot_path in saved_plots:
            print(f"  ✓ {plot_path}")
        print(f"{'='*60}")
    else:
        print("No plots were created. Check error messages above.")
else:
    print("Cannot create plots: disorder_df not available. Run the mass table generation cell first.")

Generating all disorder and diagonal factor plots. This takes a staggering 2 minutes 10...
Merged data: 186 languages with disorder and VO data
Generating 6 plots in parallel using 6 workers...
  ✓ Plot 1: 121 languages -> right_extreme_disorder_vs_vo.png
  ✓ Plot 2: 160 languages -> left_extreme_disorder_vs_vo.png
  ✓ Plot 3: 120 languages -> right_extreme_diag_factor_vs_vo.png
  ✓ Plot 4: 131 languages -> left_extreme_diag_factor_vs_vo.png
  ✓ Plot 5: 120 languages -> right_diag_factor_vs_disorder.png
  ✓ Plot 6: 131 languages -> left_diag_factor_vs_disorder.png

Successfully created 6 plots:
  ✓ plots/right_extreme_disorder_vs_vo.png
  ✓ plots/left_extreme_disorder_vs_vo.png
  ✓ plots/right_extreme_diag_factor_vs_vo.png
  ✓ plots/left_extreme_diag_factor_vs_vo.png
  ✓ plots/right_diag_factor_vs_disorder.png
  ✓ plots/left_diag_factor_vs_disorder.png
