## Imports

In [1]:
import os
import pandas as pd
from tqdm import tqdm

# Import custom modules
import data_utils
import conll_processing
import analysis

## Configuration

In [2]:
DATA_DIR = "data"
MIN_COUNT = 10  # Minimum occurrence count for positions

## 1. Load Metadata

In [3]:
# Load metadata from notebook 01
metadata = data_utils.load_metadata(os.path.join(DATA_DIR, 'metadata.pkl'))

langShortConllFiles = metadata['langShortConllFiles']
langNames = metadata['langNames']
langnameGroup = metadata['langnameGroup']
appearance_dict = metadata['appearance_dict']
ud_version = metadata['ud_version']

print(f"UD version: {ud_version}")
print(f"Languages: {len(langShortConllFiles)}")
total_files = sum(len(files) for files in langShortConllFiles.values())
print(f"Total short files to process: {total_files}")

Loaded metadata from data/metadata.pkl
UD version: 2.17
Languages: 186
Total short files to process: 814


## 2. Process CoNLL Files in Parallel

This step extracts dependency sizes for all verbal constructions across all languages.
It uses multiprocessing to parallelize across CPU cores.

**Note**: This is the most computationally intensive step and may take a long time depending on your system. On Calcul, this takes less than 40 seconds.

In [4]:
# Flatten all short files into a single list
allshortconll = []
for lang, files in langShortConllFiles.items():
    allshortconll.extend(files)

print(f"Processing {len(allshortconll)} files in parallel...")

Processing 814 files in parallel...


In [7]:
# Process files in parallel
all_langs_position2num, all_langs_position2sizes, all_langs_average_sizes = conll_processing.get_type_freq_all_files_parallel(allshortconll)
print("Processing complete!")
print(f"Results computed for {len(all_langs_position2num)} languages")

Starting processing all files, running on 80 cores


Processing files: 100%|██████████| 814/814 [00:34<00:00, 23.56it/s]



Finished processing. Combining results...
Done!
Processing complete!
Results computed for 186 languages
Done!
Processing complete!
Results computed for 186 languages


## 3. Filter by Minimum Count

Remove positions that occur fewer than MIN_COUNT times to reduce noise.

## 4. Compute Mean Aggregate Length (MAL)

For each language, compute MALₙ for n=1 to max available right dependents.

### MAL Formula

$$
\text{MAL}_n = \frac{\sum_{i=1}^{n} \text{position2sizes}[\text{right}\_i\text{\_totright}\_n]}{\sum_{i=1}^{n} \text{position2num}[\text{right}\_i\text{\_totright}\_n]}
$$

In [8]:
# Filter positions
filtered_position2num, filtered_position2sizes = analysis.filter_by_min_count(
    all_langs_position2num,
    all_langs_position2sizes,
    min_count=MIN_COUNT
)

# Count total positions before and after filtering
total_before = sum(len(positions) for positions in all_langs_position2num.values())
total_after = sum(len(positions) for positions in filtered_position2num.values())
print(f"Positions before filtering: {total_before}")
print(f"Positions after filtering (>= {MIN_COUNT}): {total_after}")
print(f"Removed: {total_before - total_after} ({100*(total_before - total_after)/total_before:.1f}%)")

Filtered to positions with at least 10 occurrences
Positions before filtering: 10759
Positions after filtering (>= 10): 6372
Removed: 4387 (40.8%)


## 5. Data Structure Documentation

### all_langs_position2num
Dictionary mapping each language code to a dictionary of position keys → occurrence counts.

**Structure**: `{lang_code: {position_key: count}}`

**Position keys**:
- `left_N`: N-th dependent to the left of verb (N=1 is closest)
- `right_N`: N-th dependent to the right of verb (N=1 is closest)
- `left_N_totleft_M`: N-th left dependent when there are M total left dependents
- `right_N_totright_M`: N-th right dependent when there are M total right dependents
- `average_totleft_M`: average size across all left dependents when M total
- `average_totright_M`: average size across all right dependents when M total

**Example**:
```python
all_langs_position2num['en'] = {
    'left_1': 12543,
    'right_1': 18732,
    'right_1_totright_2': 5234,
    ...
}
```

### all_langs_position2sizes
Dictionary mapping each language code to a dictionary of position keys → total size (sum of all dependency sizes).

**Structure**: `{lang_code: {position_key: total_size}}`

**Example**:
```python
all_langs_position2sizes['en'] = {
    'left_1': 25086,  # Total size = 12543 occurrences * ~2 words average
    'right_1': 56196,  # Total size = 18732 occurrences * ~3 words average
    ...
}
```

### all_langs_average_sizes
Dictionary mapping each language code to a dictionary of position keys → average dependency size.

**Structure**: `{lang_code: {position_key: average_size}}`

**Computation**: `average_size = total_size / count`

**Example**:
```python
all_langs_average_sizes['en'] = {
    'left_1': 2.0,  # 25086 / 12543
    'right_1': 3.0,  # 56196 / 18732
    ...
}
```

### lang2MAL
Dictionary mapping each language code to a dictionary of n → MALₙ values.

**Structure**: `{lang_code: {n: MAL_n}}`

**Computation**: For each n, compute the mean aggregate length of right dependents from position 1 to n.

**Formula**:
$$
\text{MAL}_n = \frac{\sum_{i=1}^{n} \text{position2sizes}[\text{right}\_i\text{\_totright}\_n]}{\sum_{i=1}^{n} \text{position2num}[\text{right}\_i\text{\_totright}\_n]}
$$

**Example**:
```python
lang2MAL['en'] = {
    1: 2.5,  # Average size when 1 right dependent
    2: 3.2,  # Average aggregate size when 2 right dependents
    3: 3.8,  # Average aggregate size when 3 right dependents
    ...
}
```

**Interpretation**: MALₙ typically increases with n, indicating that having more dependents correlates with longer dependencies.

In [9]:
# Compute MAL for each language
lang2MAL = analysis.compute_MAL_per_language(
    filtered_position2sizes,
    filtered_position2num
)

print(f"MAL computed for {len(lang2MAL)} languages")

# Show sample
sample_lang = list(lang2MAL.keys())[0]
print(f"\nSample (language '{sample_lang}'):")
print(lang2MAL[sample_lang])

MAL computed for 186 languages

Sample (language 'abq'):
{1: 1.8055555555555556}


## 7. Data Structure Documentation

### all_langs_position2num
Dictionary mapping each language code to a dictionary of position keys → occurrence counts.

**Structure**: `{lang_code: {position_key: count}}`

**Position keys**:
- `left_N`: N-th dependent to the left of verb (N=1 is closest)
- `right_N`: N-th dependent to the right of verb (N=1 is closest)
- `left_N_totleft_M`: N-th left dependent when there are M total left dependents
- `right_N_totright_M`: N-th right dependent when there are M total right dependents
- `average_totleft_M`: average size across all left dependents when M total
- `average_totright_M`: average size across all right dependents when M total

**Example**:
```python
all_langs_position2num['en'] = {
    'left_1': 12543,
    'right_1': 18732,
    'right_1_totright_2': 5234,
    ...
}
```

### all_langs_position2sizes
Dictionary mapping each language code to a dictionary of position keys → total size (sum of all dependency sizes).

**Structure**: `{lang_code: {position_key: total_size}}`

**Example**:
```python
all_langs_position2sizes['en'] = {
    'left_1': 25086,  # Total size = 12543 occurrences * ~2 words average
    'right_1': 56196,  # Total size = 18732 occurrences * ~3 words average
    ...
}
```

### all_langs_average_sizes
Dictionary mapping each language code to a dictionary of position keys → average dependency size.

**Structure**: `{lang_code: {position_key: average_size}}`

**Computation**: `average_size = total_size / count`

**Example**:
```python
all_langs_average_sizes['en'] = {
    'left_1': 2.0,  # 25086 / 12543
    'right_1': 3.0,  # 56196 / 18732
    ...
}
```

### lang2MAL
Dictionary mapping each language code to a dictionary of n → MALₙ values.

**Structure**: `{lang_code: {n: MAL_n}}`

**Computation**: For each n, compute the mean aggregate length of right dependents from position 1 to n.

**Formula**:
$$
\text{MAL}_n = \frac{\sum_{i=1}^{n} \text{position2sizes}[\text{right}\_i\text{\_totright}\_n]}{\sum_{i=1}^{n} \text{position2num}[\text{right}\_i\text{\_totright}\_n]}
$$

**Example**:
```python
lang2MAL['en'] = {
    1: 2.5,  # Average size when 1 right dependent
    2: 3.2,  # Average aggregate size when 2 right dependents
    3: 3.8,  # Average aggregate size when 3 right dependents
    ...
}
```

**Interpretation**: MALₙ typically increases with n, indicating that having more dependents correlates with longer dependencies.

## 6. Export Analysis Results

In [10]:
# Save all analysis results
analysis.save_analysis_results(
    all_langs_position2num,
    all_langs_position2sizes,
    all_langs_average_sizes,
    filtered_position2num,
    filtered_position2sizes,
    lang2MAL,
    output_dir=DATA_DIR
)

print(f"Analysis results saved to {DATA_DIR}/")

Saved all_langs_position2num.pkl
Saved all_langs_position2sizes.pkl
Saved all_langs_average_sizes.pkl
Saved filtered_position2num.pkl
Saved filtered_position2sizes.pkl
Saved lang2MAL.pkl
All analysis results saved to data/
Analysis results saved to data/


## Summary

This notebook has:
- ✅ Loaded metadata from notebook 01
- ✅ Processed all CoNLL files in parallel (extracted dependency sizes)
- ✅ Filtered positions by minimum count (>= 10)
- ✅ Computed Mean Aggregate Length (MAL) for each language
- ✅ Exported 6 analysis result files to data/

**Next step**: Run `03_visualization.ipynb` to create plots and explore results.