## Imports

**Environment Setup**: Select the `ragenv` kernel from the kernel picker (top-right corner of the notebook). The kernel has been registered and should be available in the list.

In [1]:
import os
import pandas as pd
from glob import glob
from tqdm import tqdm

# Import custom modules
import data_utils
import conll_processing
import validation

## Configuration

In [2]:
# Paths and constants
UD_VERSION = "2.17"
UD_DIR = f"ud-treebanks-v{UD_VERSION}"
CREDENTIALS_FILE = "typometrics-c4750cac2e21.json"
SPREADSHEET_URL = "https://docs.google.com/spreadsheets/d/1IP3ebsNNVAsQ5sxmBnfEAmZc4f0iotAL9hd4aqOOcEg/edit"
DATA_DIR = "data"

## 1. Download and Extract UD Treebanks

Download the latest UD treebanks from the official repository if not already present.

In [3]:
# Check if already downloaded
if not os.path.exists(UD_DIR):
    print(f"UD treebanks v{UD_VERSION} not found. Please download manually:")
    print(f"https://lindat.mff.cuni.cz/repository/xmlui/handle/11234/1-5772")
    print(f"Extract to: {os.path.abspath(UD_DIR)}")
else:
    print(f"UD treebanks v{UD_VERSION} found at: {os.path.abspath(UD_DIR)}")

UD treebanks v2.17 found at: /bigstorage/kim/typometrics/dataanalysis/ud-treebanks-v2.17


## 2. Load CoNLL Files

In [4]:
# Find all CoNLL-U files and group by language
# This matches the original getAllConllFilesGroup() function
langConllFiles = {}

# List all directories in the UD folder
doc_list = [d for d in sorted(os.listdir(UD_DIR)) if os.path.isdir(os.path.join(UD_DIR, d))]

for doc_name in doc_list:
    doc_path = os.path.join(UD_DIR, doc_name)
    # Find all .conllu files in this directory (excluding not-to-release)
    conll_files = [
        os.path.join(doc_path, f) 
        for f in os.listdir(doc_path) 
        if f.endswith(".conllu") and "not-to-release" not in doc_name
    ]
    
    if conll_files:
        # Extract language code from first filename (e.g., en_ewt-ud-train.conllu -> en)
        first_file = os.path.basename(conll_files[0])
        lang_code = first_file.split('_', 1)[0].lower()
        
        # Add all files for this language
        if lang_code not in langConllFiles:
            langConllFiles[lang_code] = []
        langConllFiles[lang_code].extend(conll_files)

print(f"Languages represented: {len(langConllFiles)}")
print(f"Sample languages: {list(langConllFiles.keys())[:10]}")

Languages represented: 186
Sample languages: ['abq', 'ab', 'af', 'akk', 'aqz', 'sq', 'gsw', 'am', 'grc', 'hbo']


## 3. Load Google Sheets Metadata

In [5]:
# Load Google Sheets data
sheets_data = data_utils.load_google_sheets(CREDENTIALS_FILE, SPREADSHEET_URL)
print("Loaded sheets:", list(sheets_data['sheets'].keys()))

Loaded sheets: ['my_language', 'language_to_group', 'appearance', 'all_languages_code']


In [6]:
# Display sheet previews
for sheet_name, df in sheets_data['dataframes'].items():
    print(f"\n{sheet_name}:")
    print(df.head())


my_language:
  code        displayName
0   ca            Catalan
1   cu  OldChurchSlavonic
2   nl              Dutch
3   el              Greek
4   ht            Haitian

language_to_group:
    Language          Group     Genus Column 1    Simple Group Area
0      Abaza      Caucasian                          Caucasian    E
1     Abkhaz      Caucasian                          Caucasian    E
2  Afrikaans  Indo-European  Germanic            Indo-European   Af
3   Akkadian        Semitic                        Afroasiatic   ME
4    Akuntsu         Tupian                     South-American   SA

appearance:
           Group Default Color
0         Italic         brown
1    Baltoslavic        purple
2       Germanic         olive
3  Indo-European     royalBlue
4   Austronesian     limeGreen

all_languages_code:
  code   language
0   ab  Abkhazian
1   aa       Afar
2   af  Afrikaans
3   ak       Akan
4   sq   Albanian


## 4. Create Language Mappings

In [7]:
# Create language mappings
# Note: langNames comes from all_languages_code sheet (all ISO codes),
# then overridden by my_language sheet (custom display names for languages with spaces)
mappings = data_utils.create_language_mappings(sheets_data)

# Filter to only languages in our treebanks
all_langNames = mappings['langNames']
langNames = {lang: all_langNames[lang] for lang in langConllFiles if lang in all_langNames}

langnameGroup = mappings['langnameGroup']
group2lang = mappings['group2lang']
appearance_dict = mappings['appearance_dict']

print(f"Total language codes available: {len(all_langNames)}")
print(f"Languages in our treebanks: {len(langConllFiles)}")
print(f"Languages with names: {len(langNames)}")
print(f"Total language groups: {len(set(langnameGroup.values()))}")
print(f"Groups with colors: {len(appearance_dict)}")

Total language codes available: 8041
Languages in our treebanks: 186
Languages with names: 186
Total language groups: 11
Groups with colors: 29


## 5. Validate Language Codes

Check which languages in our treebanks:
- Have names with spaces (need custom display names in my_language sheet)
- Are missing from the language code mapping

In [8]:
# Validate language codes
my_language_sheet = sheets_data['sheets']['my_language']
validation.validate_language_codes(langConllFiles, langNames, my_language_sheet)
print("Language code validation complete. Check Google Sheet column E for results.")

Language with space:
 
Language to add: []
Language code validation complete. Check Google Sheet column E for results.
Language code validation complete. Check Google Sheet column E for results.


## 6. Validate Language Groups

Check which languages in our treebanks are missing group assignments.

In [9]:
# Validate language groups
language_to_group_sheet = sheets_data['sheets']['language_to_group']
validation.validate_language_groups(langConllFiles, langNames, langnameGroup, language_to_group_sheet)
print("Language group validation complete. Check Google Sheet column H for results.")

0 language groups to add:
 
['OK', '', '', '', '', '', '', '', '', '', '']
Language group validation complete. Check Google Sheet column H for results.
Language group validation complete. Check Google Sheet column H for results.


## 7. Compute Basic Statistics

In [10]:
# Compute basic statistics for each language
stats_df = validation.compute_basic_statistics(langConllFiles, langNames, langnameGroup)
print(stats_df.head(20))

Computing statistics:   0%|          | 0/186 [00:00<?, ?it/s]

   language   languageName           group  nConllFiles  nSentences  nTokens  \
0       abq          Abaza       Caucasian            1          98     1240   
1        ab         Abkhaz       Caucasian            1        1316    14533   
2        af      Afrikaans   Indo-European            3        1936    55062   
3       akk       Akkadian     Afroasiatic            2        1976    31588   
4       aqz        Akuntsu  South-American            1         343     2762   
5        sq       Albanian   Indo-European            4         263     5270   
6       gsw    SwissGerman   Indo-European            2        1078    26213   
7        am        Amharic     Afroasiatic            1        1074    15932   
8       grc   AncientGreek   Indo-European            9       32585   540353   
9       hbo  AncientHebrew     Afroasiatic            3        5610   208302   
10      apu        Apurinã  South-American            1         165     1990   
11       ar         Arabic     Afroasiat

In [11]:
# Summary statistics
print("\nSummary:")
print(stats_df.describe())


Summary:
       nConllFiles     nSentences       nTokens  avgSentenceLength
count   186.000000     186.000000  1.860000e+02         186.000000
mean      3.688172   12441.016129  2.475959e+05          17.354590
std       4.530923   33591.599656  6.760125e+05           7.100783
min       1.000000       8.000000  8.700000e+01           6.220447
25%       1.000000     248.750000  3.422000e+03          11.972462
50%       2.000000    1221.000000  1.828150e+04          15.634257
75%       4.000000    5938.000000  1.449330e+05          21.836621
max      29.000000  253797.000000  5.285270e+06          44.322515


## 8. Create Short CoNLL Files

Split large CoNLL files into chunks of 10,000 sentences for parallel processing.
This takes 33 seconds on Calcul.

In [None]:
# Create short files if not already present
short_dir = f"{UD_DIR}_short"
if not os.path.exists(short_dir):
    print("Creating short CoNLL files (this may take a while)...")
    conll_processing.make_shorter_conll_files(langConllFiles, UD_VERSION)
    print("Short files created.")
else:
    print(f"Short files already exist in {short_dir}")

## 9. Read Short CoNLL Files

In [12]:
# Read short files
print("Reading short CoNLL files...")
langShortConllFiles, allshortconll = conll_processing.read_shorter_conll_files(langConllFiles, UD_VERSION)

total_short_files = sum(len(files) for files in langShortConllFiles.values())
print(f"Total short files: {total_short_files}")
print(f"Languages with short files: {len(langShortConllFiles)}")

Reading short CoNLL files...
Found 814 short CoNLL files in 2.17_short
Total short files: 814
Languages with short files: 186
Found 814 short CoNLL files in 2.17_short
Total short files: 814
Languages with short files: 186


## 10. Export Metadata

Save all metadata and file lists for use in subsequent notebooks.

In [13]:
# Save metadata
metadata = {
    'langConllFiles': langConllFiles,
    'langShortConllFiles': langShortConllFiles,
    'langNames': langNames,
    'langnameGroup': langnameGroup,
    'group2lang': group2lang,
    'appearance_dict': appearance_dict,
    'ud_version': UD_VERSION
}

data_utils.save_metadata(metadata, os.path.join(DATA_DIR, 'metadata.pkl'))
print(f"Metadata saved to {DATA_DIR}/metadata.pkl")

Saved metadata to data/metadata.pkl
Metadata saved to data/metadata.pkl


## Summary

This notebook has:
- ✅ Loaded UD treebanks v2.17
- ✅ Connected to Google Sheets for language metadata
- ✅ Validated language codes and groups
- ✅ Computed basic statistics (files, sentences, tokens)
- ✅ Created short CoNLL files for parallel processing
- ✅ Exported metadata for downstream notebooks

**Next step**: Run `02_dependency_analysis.ipynb` to compute dependency size metrics.