# 2. Metric Computation (Core Analysis)

**Summary**: The main computational workhorse. Calculates dependency sizes, bastard statistics, VO/HI scores, and sentence disorder metrics using parallel processing.

**Key Steps**:
1. Load metadata and short CoNLL files.
2. Parallel Processing: Compute dependency size stats, bastard counts, and VO/HI scores for all languages.
3. (Optional) Compute sentence-level disorder statistics.
4. Filter results by minimum occurrence count.

**Inputs**:
- `2.17_short/` (Short CoNLL files)
- `data/metadata.pkl`

**Outputs**:
- `data/all_langs_position2num.pkl` (Counts)
- `data/all_langs_average_sizes.pkl` (Geometric means)
- `data/vo_vs_hi_scores.csv`
- `data/sentence_disorder_percentages.pkl`

**Runtime**: ~2-5 minutes (parallelized)

---

In [1]:
import os
import pandas as pd
from tqdm import tqdm

# Import custom modules
import data_utils
import conll_processing
import analysis

## Configuration

In [2]:
DATA_DIR = "data"
MIN_COUNT = 10  # Minimum occurrence count for positions

## 1. Load Metadata

In [3]:
# Load metadata from notebook 01
metadata = data_utils.load_metadata(os.path.join(DATA_DIR, 'metadata.pkl'))

langShortConllFiles = metadata['langShortConllFiles']
langNames = metadata['langNames']
langnameGroup = metadata['langnameGroup']
appearance_dict = metadata['appearance_dict']
ud_version = metadata['ud_version']

print(f"UD version: {ud_version}")
print(f"Languages: {len(langShortConllFiles)}")
total_files = sum(len(files) for files in langShortConllFiles.values())
print(f"Total short files to process: {total_files}")

Loaded metadata from data/metadata.pkl
UD version: 2.17
Languages: 187
Total short files to process: 810


## 2. Process CoNLL Files in Parallel

This step extracts dependency sizes for all verbal constructions across all languages.
It uses multiprocessing to parallelize across CPU cores.

**Note**: This is the most computationally intensive step and may take a long time depending on your system. On Calcul, this takes less than 40 seconds without bastards, and about 1 min 15 seconds with bastards.

In [4]:
# Flatten all short files into a single list
allshortconll = []
for lang, files in langShortConllFiles.items():
    allshortconll.extend(files)

print(f"Processing {len(allshortconll)} files in parallel...")

Processing 810 files in parallel...


In [5]:
# Unified processing: Compute ALL metrics in one pass
import importlib
importlib.reload(conll_processing)

print("Starting unified analysis (Dependencies, Bastards, VO/HI, Disorder). this should take about a minute on 80 cores...")

# Allow configuring sentence disorder and configuration examples
compute_sentence_disorder = True
collect_config_examples = True  # Collect examples for HTML visualization

results = conll_processing.get_all_stats_parallel(
    allshortconll,
    include_bastards=True,
    compute_sentence_disorder=compute_sentence_disorder,
    collect_config_examples=collect_config_examples,
    max_examples_per_config=10
)

# Unpack results
if collect_config_examples:
    (all_langs_position2num, all_langs_position2sizes, all_langs_average_sizes, all_langs_average_charsizes, 
     lang_bastard_stats, global_bastard_relations, 
     lang_vo_hi_scores, 
     sentence_disorder_pct,
     all_config_examples) = results
    print(f"Collected configuration examples for {len(all_config_examples)} languages")
else:
    (all_langs_position2num, all_langs_position2sizes, all_langs_average_sizes, all_langs_average_charsizes, 
     lang_bastard_stats, global_bastard_relations, 
     lang_vo_hi_scores, 
     sentence_disorder_pct) = results
    all_config_examples = None

print("Processing complete!")
print(f"Results computed for {len(all_langs_position2num)} languages")

# Save VO/HI Scores immediately
vo_hi_rows = []
for lang, scores in lang_vo_hi_scores.items():
    row = scores.copy()
    row['language_code'] = lang
    row['language_name'] = langNames.get(lang, lang)
    row['group'] = langnameGroup.get(row['language_name'], 'Unknown')
    vo_hi_rows.append(row)

vo_hi_df = pd.DataFrame(vo_hi_rows)
vo_hi_output_file = os.path.join(DATA_DIR, 'vo_vs_hi_scores.csv')
vo_hi_df.to_csv(vo_hi_output_file, index=False)
print(f"Saved VO/HI scores to {vo_hi_output_file}")

# Save configuration examples if collected
if all_config_examples is not None:
    import pickle
    config_examples_path = os.path.join(DATA_DIR, 'all_config_examples.pkl')
    with open(config_examples_path, 'wb') as f:
        pickle.dump(all_config_examples, f)
    print(f"Saved configuration examples to {config_examples_path}")

# We already have bastard stats, so no need to re-run later
print("Bastard stats also computed.")

Starting unified analysis (Dependencies, Bastards, VO/HI, Disorder). this should take about a minute on 80 cores...
Starting unified processing on 80 cores


Processing files: 100%|██████████| 810/810 [01:06<00:00, 12.22it/s]


Finished processing. Combining results...
Done!
Collected configuration examples for 186 languages
Processing complete!
Results computed for 186 languages
Saved VO/HI scores to data/vo_vs_hi_scores.csv
Saved configuration examples to data/all_config_examples.pkl
Bastard stats also computed.


In [6]:
# In notebook 02 after running:
print(f"VO: {len([l for l in lang_vo_hi_scores.values() if l.get('vo_type') == 'VO'])}")
print(f"OV: {len([l for l in lang_vo_hi_scores.values() if l.get('vo_type') == 'OV'])}")
print(f"NDO: {len([l for l in lang_vo_hi_scores.values() if l.get('vo_type') == 'NDO'])}")

VO: 93
OV: 69
NDO: 24


## 2b. Compute Sentence-Level Disorder (Optional)

this cell is useful:
* It handles saving: The new unified processing block I added computes the data (sentence_disorder_pct), but it does not save it. This cell converts that raw data into a DataFrame and saves it to sentence_disorder_percentages.csv and .pkl.
* It formats the data: It transforms the raw nested dictionary into a flat table structure that is needed for the subsequent notebooks.

Set `compute_sentence_disorder=True` to enable this analysis.

In [7]:
# Convert sentence disorder percentages to DataFrame and save
if sentence_disorder_pct is not None:
    import pickle
    
    # Convert to DataFrame format suitable for notebook 04
    sentence_disorder_rows = []
    for lang_code in sentence_disorder_pct:
        lang_name = langNames.get(lang_code, lang_code)
        group = langnameGroup.get(lang_name, 'Unknown')
        
        row = {
            'language_code': lang_code,
            'language_name': lang_name,
            'group': group
        }
        
        # Add columns for each configuration
        # Process granular ordering stats
        # Key is now (side, tot, pair_idx)
        for (side, tot, idx), counts in sentence_disorder_pct[lang_code].items():
            if isinstance(counts, int):
                # Handle integer total counts
                row[f'{side}_tot_{tot}_{idx}'] = counts
                continue
            
            total = counts['lt'] + counts['eq'] + counts['gt']
            
            prefix = f'{side}_tot_{tot}_pair_{idx}'
            row[f'{prefix}_lt'] = counts['lt']
            row[f'{prefix}_eq'] = counts['eq']
            row[f'{prefix}_gt'] = counts['gt']
            row[f'{prefix}_total'] = total
            
            if total > 0:
                row[f'{prefix}_lt_pct'] = counts['lt'] / total * 100
                row[f'{prefix}_eq_pct'] = counts['eq'] / total * 100
                row[f'{prefix}_gt_pct'] = counts['gt'] / total * 100
            else:
                row[f'{prefix}_lt_pct'] = 0
                row[f'{prefix}_eq_pct'] = 0
                row[f'{prefix}_gt_pct'] = 0
        
        sentence_disorder_rows.append(row)
    
    sentence_disorder_df = pd.DataFrame(sentence_disorder_rows)
    
    print(f"\nSentence-level disorder DataFrame:")
    print(f"  Shape: {sentence_disorder_df.shape}")
    # print(f"  Columns: {list(sentence_disorder_df.columns)}")
    print("\nSample:")
    print(sentence_disorder_df.head())
    
    # Save to pickle and CSV
    with open(os.path.join(DATA_DIR, 'sentence_disorder_percentages.pkl'), 'wb') as f:
        pickle.dump(sentence_disorder_pct, f)
    
    sentence_disorder_df.to_csv(os.path.join(DATA_DIR, 'sentence_disorder_percentages.csv'), index=False)
    
    print(f"\nSaved sentence disorder data to {DATA_DIR}/")
else:
    print("Sentence-level disorder not computed (compute_sentence_disorder=False)")
    sentence_disorder_df = None


Sentence-level disorder DataFrame:
  Shape: (186, 2647)

Sample:
  language_code language_name           group  left_tot_1_total  \
0           abq         Abaza       Caucasian               108   
1            ab        Abkhaz       Caucasian              1483   
2            af     Afrikaans   Indo-European              1391   
3           akk      Akkadian     Afroasiatic              1317   
4           aqz       Akuntsu  South-American               180   

   right_tot_1_total  left_tot_2_total  left_tot_2_pair_0_lt  \
0               35.0                48                    10   
1              354.0               742                   116   
2             1786.0              1362                   720   
3              455.0               978                   338   
4               82.0                52                     2   

   left_tot_2_pair_0_eq  left_tot_2_pair_0_gt  left_tot_2_pair_0_total  ...  \
0                    19                    19                      

## 3. Filter by Minimum Count

Remove positions that occur fewer than MIN_COUNT times to reduce noise.

## 4. Compute Geometric Mean Aggregate Length (MAL)

For each language, compute MALₙ for n=1 to max available right dependents.

We use the **Geometric Mean** to be robust against outliers in constituent length distributions.

### MAL Formula

$$
\text{MAL}_n = \exp\left(\frac{\sum_{i=1}^{n} \text{position2sizes}[\text{right}\_i\text{\_totright}\_n]}{\sum_{i=1}^{n} \text{position2num}[\text{right}\_i\text{\_totright}\_n]}\right)
$$

Where `position2sizes` stores the sum of logarithms of constituent sizes.

In [8]:
# Filter positions
filtered_position2num, filtered_position2sizes = analysis.filter_by_min_count(
    all_langs_position2num,
    all_langs_position2sizes,
    min_count=MIN_COUNT
)

# Count total positions before and after filtering
total_before = sum(len(positions) for positions in all_langs_position2num.values())
total_after = sum(len(positions) for positions in filtered_position2num.values())
print(f"Positions before filtering: {total_before}")
print(f"Positions after filtering (>= {MIN_COUNT}): {total_after}")
print(f"Removed: {total_before - total_after} ({100*(total_before - total_after)/total_before:.1f}%)")

Filtered to positions with at least 10 occurrences
Positions before filtering: 22866
Positions after filtering (>= 10): 12561
Removed: 10305 (45.1%)


## 5. Data Structure Documentation

### all_langs_position2num
Dictionary mapping each language code to a dictionary of position keys → occurrence counts.

**Structure**: `{lang_code: {position_key: count}}`

**Position keys**:
- `left_N`: N-th dependent to the left of verb (N=1 is closest)
- `right_N`: N-th dependent to the right of verb (N=1 is closest)
- `left_N_totleft_M`: N-th left dependent when there are M total left dependents
- `right_N_totright_M`: N-th right dependent when there are M total right dependents
- `average_totleft_M`: average size across all left dependents when M total
- `average_totright_M`: average size across all right dependents when M total

**Example**:
```python
all_langs_position2num['en'] = {
    'left_1': 12543,
    'right_1': 18732,
    'right_1_totright_2': 5234,
    ...
}
```

### all_langs_position2sizes
Dictionary mapping each language code to a dictionary of position keys → **total log size** (sum of logarithms of all dependency sizes).

**Structure**: `{lang_code: {position_key: total_log_size}}`

**Note**: We use the logarithm of sizes to compute geometric means, which are more appropriate for the highly skewed distribution of constituent sizes.

**Example**:
```python
all_langs_position2sizes['en'] = {
    'left_1': 8694.5,  # Sum of logs of sizes
    'right_1': 20576.2, 
    ...
}
```

### all_langs_average_sizes
Dictionary mapping each language code to a dictionary of position keys → **Geometric Mean** dependency size.

**Structure**: `{lang_code: {position_key: geometric_mean_size}}`

**Computation**: `geometric_mean = exp(total_log_size / count)`

**Example**:
```python
all_langs_average_sizes['en'] = {
    'left_1': 2.0,  # exp(8694.5 / 12543)
    'right_1': 3.0, 
    ...
}
```

### lang2MAL
Dictionary mapping each language code to a dictionary of n → MALₙ values.

**Structure**: `{lang_code: {n: MAL_n}}`

**Computation**: For each n, compute the **Geometric Mean Aggregate Length** of right dependents from position 1 to n.

**Formula**:
$$
\text{MAL}_n = \exp\left(\frac{\sum_{i=1}^{n} \text{position2sizes}[\text{right}\_i\text{\_totright}\_n]}{\sum_{i=1}^{n} \text{position2num}[\text{right}\_i\text{\_totright}\_n]}\right)
$$

**Example**:
```python
lang2MAL['en'] = {
    1: 2.5,  # GM size when 1 right dependent
    2: 3.2,  # GM aggregate size when 2 right dependents
    3: 3.8,  # GM aggregate size when 3 right dependents
    ...
}
```

**Interpretation**: MALₙ typically increases with n, indicating that having more dependents correlates with longer dependencies (Menzerath's Law or similar effects often visualized using this).

In [9]:
# Compute MAL for each language
lang2MAL = analysis.compute_MAL_per_language(
    filtered_position2sizes,
    filtered_position2num
)

print(f"MAL computed for {len(lang2MAL)} languages")

# Show sample
sample_lang = list(lang2MAL.keys())[0]
print(f"\nSample (language '{sample_lang}'):")
print(lang2MAL[sample_lang])

MAL computed for 186 languages

Sample (language 'abq'):
{1: 1.5134613006817135}


## Analysis of bastard frequency

In [10]:
# (Bastard analysis integrated into main pass)
pass

In [11]:
import json
# Compute bastard statistics
# Stats already computed above. Skipping re-computation.
# lang_bastard_stats, global_bastard_relations = ...

# Create a DataFrame for ranking
ranking_data = []
for lang, stats in lang_bastard_stats.items():
    verbs = stats['verbs']
    bastards = stats['bastards']
    percentage = (bastards / verbs * 100) if verbs > 0 else 0
    
    # Find most frequent relation
    relations = stats.get('relations', {})
    if relations:
        top_rel = max(relations, key=relations.get)
        top_rel_count = relations[top_rel]
        top_rel_str = f"{top_rel} ({top_rel_count})"
    else:
        top_rel_str = "None"

    ranking_data.append({
        'Code': lang,
        'Language': langNames.get(lang, lang),
        'Verbs': verbs,
        'Bastards': bastards,
        'Bastards_per_Verb_Pct': percentage,
        'Top_Bastard_Rel': top_rel_str
    })

df_ranking = pd.DataFrame(ranking_data)
df_ranking = df_ranking.sort_values('Bastards_per_Verb_Pct', ascending=False).reset_index(drop=True)

print("\nTop 20 Languages by Bastard Frequency (per Verb):")
print(df_ranking.head(20))

print("\nGlobal Bastard Relation Frequencies:")
sorted_relations = sorted(global_bastard_relations.items(), key=lambda x: x[1], reverse=True)
for rel, count in sorted_relations[:20]:
    print(f"{rel}: {count}")

# Export examples
examples_dir = os.path.join(DATA_DIR, 'bastard_examples')
os.makedirs(examples_dir, exist_ok=True)

print(f"\nExporting examples to {examples_dir}...")
count_exported = 0

for lang, stats in lang_bastard_stats.items():
    relations = stats.get('relations', {})
    examples = stats.get('examples', {})
    
    if relations and examples:
        # Get most frequent relation
        top_rel = max(relations, key=relations.get)
        
        if top_rel in examples:
            # Create file content
            content = f"# Language: {lang} ({langNames.get(lang, lang)})\n"
            content += f"# Most frequent bastard relation: {top_rel}\n"
            content += f"# Total bastards with this relation: {relations[top_rel]}\n\n"
            
            for i, tree_str in enumerate(examples[top_rel]):
                content += f"# Example {i+1}\n"
                content += tree_str + "\n"
            
            # Save to file
            filename = os.path.join(examples_dir, f"{lang}_{top_rel}_examples.conllu")
            with open(filename, 'w', encoding='utf-8') as f:
                f.write(content)
            count_exported += 1

print(f"Exported example files for {count_exported} languages.")

# Export detailed bastard relation counts per language
relation_counts_export = {}
for lang, stats in lang_bastard_stats.items():
    relation_counts_export[lang] = stats.get('relations', {})

relation_export_path = os.path.join(DATA_DIR, 'bastard_relations_per_lang.json')
with open(relation_export_path, 'w', encoding='utf-8') as f:
    json.dump(relation_counts_export, f, indent=2)

print(f"Exported detailed bastard relations to {relation_export_path}")



Top 20 Languages by Bastard Frequency (per Verb):
   Code           Language   Verbs  Bastards  Bastards_per_Verb_Pct  \
0   grc       AncientGreek   80843     24326              30.090422   
1   xpg           Phrygian     234        58              24.786325   
2    la              Latin  149655     29636              19.802880   
3    ps             Pashto     327        63              19.266055   
4   pro       OldProvençal    5468       895              16.367959   
5   orv      OldEastSlavic   69895     11071              15.839473   
6    sa           Sanskrit   40150      6230              15.516812   
7    hu          Hungarian    3663       491              13.404313   
8   swl        SwedishSign     611        81              13.256956   
9   fro          OldFrench   37730      4609              12.215743   
10  gub          Guajajára    1060       129              12.169811   
11  hit            Hittite     213        25              11.737089   
12   nl              Dutch

In [12]:
# showing the bastard tables as df

pd.set_option('display.max_rows', 50)

print("Top Languages by Bastard Frequency (per Verb):")
display(df_ranking)

print("\nGlobal Bastard Relation Frequencies:")
df_global_relations = pd.DataFrame(
    sorted(global_bastard_relations.items(), key=lambda x: x[1], reverse=True),
    columns=['Relation', 'Count']
)
display(df_global_relations)

# Save bastard stats to CSV for further analysis (e.g. Notebook 05)
bastard_csv_path = os.path.join(DATA_DIR, 'bastard_stats.csv')
df_ranking.to_csv(bastard_csv_path, index=False)
print(f"Saved bastard statistics to {bastard_csv_path}")


Top Languages by Bastard Frequency (per Verb):


Unnamed: 0,Code,Language,Verbs,Bastards,Bastards_per_Verb_Pct,Top_Bastard_Rel
0,grc,AncientGreek,80843,24326,30.090422,nmod (4353)
1,xpg,Phrygian,234,58,24.786325,conj (18)
2,la,Latin,149655,29636,19.802880,acl (3835)
3,ps,Pashto,327,63,19.266055,acl (25)
4,pro,OldProvençal,5468,895,16.367959,obj (160)
...,...,...,...,...,...,...
181,bor,Borôro,27057,0,0.000000,
182,aii,Assyrian,57,0,0.000000,
183,apu,Apurinã,215,0,0.000000,
184,vep,Veps,183,0,0.000000,



Global Bastard Relation Frequencies:


Unnamed: 0,Relation,Count
0,acl,27129
1,conj,21541
2,obl,20089
3,obj,19446
4,nmod,18131
5,advcl,10830
6,advmod,10611
7,mark,10499
8,det,9310
9,cc,9174


Saved bastard statistics to data/bastard_stats.csv


## 6. Export Analysis Results

In [13]:
# Save all analysis results
analysis.save_analysis_results(
    all_langs_position2num,
    all_langs_position2sizes,
    all_langs_average_sizes,
    filtered_position2num,
    filtered_position2sizes,
    lang2MAL,
    output_dir=DATA_DIR
)

print(f"Analysis results saved to {DATA_DIR}/")

Saved all_langs_position2num.pkl
Saved all_langs_position2sizes.pkl
Saved all_langs_average_sizes.pkl
Saved filtered_position2num.pkl
Saved filtered_position2sizes.pkl
Saved lang2MAL.pkl
All analysis results saved to data/
Analysis results saved to data/


## Summary

This notebook has:
- ✅ Loaded metadata from notebook 01
- ✅ Processed all CoNLL files in parallel (extracted dependency sizes)
- ✅ Filtered positions by minimum count (>= 10)
- ✅ Computed Mean Aggregate Length (MAL) for each language
- ✅ Exported 6 analysis result files to data/
- ✅ Analyzed bastard frequencies and exported examples

**Next step**: Run `03_visualization.ipynb` to create plots and explore results.