# Plan to Medal: Text Normalization Challenge (Russian)

Objectives:
- Build a strong, fast baseline with reliable CV; iterate to reach ≥ bronze.

High-level Strategy:
1) Data understanding:
   - Inspect train/test schemas, sizes, target definition, and sample submission.
   - Verify label distribution and common semiotic classes (e.g., PLAIN, PUNCT, DATE, CARDINAL, etc.).
2) CV protocol:
   - Mirror test distribution by grouping/stratifying on semiotic class if available.
   - Deterministic KFold with fixed seed; save folds to disk.
3) Baseline modeling:
   - Token-level classifier to predict the normalized form from the written token.
   - Start with class-conditional rules + ML fallback:
     a) Heuristics/rules for PLAIN/PUNCT/LETTERS-to-lowercase passthrough.
     b) ML model(s) for challenging classes (NUMBER/CARDINAL/ORDINAL/DATE/TIME/MONEY/MEASURE).
   - Baseline ML: CatBoost/XGBoost with character n-gram features and simple regex-derived features.
4) Feature engineering:
   - Regex flags (has_digits, has_letters, cyrillic_only, latin_only, contains_currency, contains_colon, etc.).
   - Character n-gram TF-IDF blocks; cache matrices.
   - Context features: previous/next token/symbols.
5) Inference pipeline:
   - Semiotic class detection (model or rules) → route to per-class normalizer (rule or model).
   - Fallback to identity where appropriate.
6) Iteration & validation discipline:
   - Log OOF accuracy by class; mine top error buckets and add rules.
   - Save OOF/test predictions and per-class confusion matrices.
7) Ensembling:
   - Blend per-class specialized models if beneficial.

Immediate Next Steps:
- Verify GPU, list files, load train/test heads, parse schema, inspect sample submission.
- Identify target column(s) and class labels; count semiotic classes.
- Establish a simple deterministic CV and a minimal baseline (identity/rule-based) for a quick score.

Questions for Experts (next cell will request review):
- What medal-winning architectures/pipelines won here? Rule-based cascades + ML? Best per-class strategies?
- Key regex/rule patterns that cover most errors? Common pitfalls/leakage traps in this dataset?

In [1]:
# Environment and data inspection
import os, sys, zipfile, json, re, time, textwrap, subprocess, io
import pandas as pd
from pathlib import Path

print('CWD:', os.getcwd())
print('Files:', os.listdir())

# Quick CUDA/NVIDIA presence check without heavy installs
print('\nCUDA paths present:', os.path.exists('/usr/local/cuda'), os.path.exists('/usr/local/nvidia'))
try:
    out = subprocess.run(['bash','-lc','nvcc --version'], capture_output=True, text=True, timeout=5)
    print('nvcc:', out.stdout.splitlines()[-1] if out.returncode==0 else out.stderr.strip())
except Exception as e:
    print('nvcc check skipped:', e)

def zip_head(path, n=5):
    df = pd.read_csv(path, compression='zip', nrows=n)
    print(f'Head of {path}:')
    print(df.head())
    return df

def zip_columns(path):
    df = pd.read_csv(path, compression='zip', nrows=1)
    return list(df.columns)

def zip_row_count(path, chunksize=1_000_000):
    cnt = 0
    for chunk in pd.read_csv(path, compression='zip', chunksize=chunksize):
        cnt += len(chunk)
    return cnt

train_path = 'ru_train.csv.zip'
test_path = 'ru_test_2.csv.zip'
sub_path = 'ru_sample_submission_2.csv.zip'

print('\nColumns:')
for p in [train_path, test_path, sub_path]:
    try:
        print(p, '->', zip_columns(p))
    except Exception as e:
        print(p, '-> error:', e)

print('\nPreview:')
train_head = zip_head(train_path, 5)
test_head = zip_head(test_path, 5)
sub_head = zip_head(sub_path, 5)

print('\nCounting rows (this may take a moment, logs per file):')
t0=time.time()
for p in [train_path, test_path, sub_path]:
    t1=time.time()
    n = zip_row_count(p, chunksize=500_000)
    print(f'{p}: rows={n:,} (elapsed {time.time()-t1:.1f}s)')
print('Total elapsed:', time.time()-t0, 's')

# Heuristic guess of key columns
train_cols = list(train_head.columns)
print('\nGuessed columns in train:', train_cols)
likely_id = [c for c in train_cols if c.lower() in ('id','index') or c.endswith('_id')]
likely_before = [c for c in train_cols if c.lower()=='before']
likely_after = [c for c in train_cols if c.lower()=='after']
likely_class = [c for c in train_cols if c.lower() in ('class','semiotic_class')]
likely_sent = [c for c in train_cols if 'sent' in c.lower()]
print('id:', likely_id, 'before:', likely_before, 'after:', likely_after, 'class:', likely_class, 'sentence:', likely_sent)

# Show basic value_counts if columns exist
if likely_class:
    vc = pd.read_csv(train_path, compression='zip', usecols=[likely_class[0]], nrows=1_000_000)[likely_class[0]].value_counts().head(20)
    print('\nClass distribution (first 1M rows):')
    print(vc)

# Sample identity-rate estimate on small sample
if likely_before and likely_after:
    sample = pd.read_csv(train_path, compression='zip', nrows=200_000, usecols=[likely_before[0], likely_after[0]])
    ident_rate = (sample[likely_before[0]] == sample[likely_after[0]]).mean()
    print(f'\nApprox identity rate on 200k sample: {ident_rate:.4f}')

print('\nDone.')

CWD: /app/agent_run_states/text-normalization-challenge-russian-language-spray-20250911-234104
Files: ['main.ipynb', 'agent_metadata', 'ru_sample_submission_2.csv.zip', 'requirements.txt', 'task.txt', 'vm_info.json', 'ru_test_2.csv.zip', 'docker_run.log', 'description.md', 'ru_train.csv.zip']

CUDA paths present: True True
nvcc: bash: line 1: nvcc: command not found

Columns:
ru_train.csv.zip -> ['sentence_id', 'token_id', 'class', 'before', 'after']
ru_test_2.csv.zip -> ['sentence_id', 'token_id', 'before']
ru_sample_submission_2.csv.zip -> ['id', 'after']

Preview:
Head of ru_train.csv.zip:
   sentence_id  token_id  class     before  \
0            0         0  PLAIN         По   
1            0         1  PLAIN  состоянию   
2            0         2  PLAIN         на   
3            0         3   DATE   1862 год   
4            0         4  PUNCT          .   

                                    after  
0                                      По  
1                               сос

ru_train.csv.zip: rows=9,515,325 (elapsed 9.5s)


ru_test_2.csv.zip: rows=1,059,191 (elapsed 0.5s)


ru_sample_submission_2.csv.zip: rows=1,059,191 (elapsed 0.8s)
Total elapsed: 10.918184041976929 s

Guessed columns in train: ['sentence_id', 'token_id', 'class', 'before', 'after']
id: ['sentence_id', 'token_id'] before: ['before'] after: ['after'] class: ['class'] sentence: ['sentence_id']



Class distribution (first 1M rows):
class
PLAIN         694962
PUNCT         217498
CARDINAL       25691
LETTERS        18609
DATE           17593
VERBATIM       14572
ORDINAL         4311
MEASURE         3818
TELEPHONE        966
DECIMAL          657
ELECTRONIC       549
MONEY            219
FRACTION         214
DIGIT            189
TIME             152
Name: count, dtype: int64



Approx identity rate on 200k sample: 0.8787

Done.


In [2]:
# Baseline submission: memorize (before -> most frequent after) and fallback to identity
import pandas as pd
import time
from collections import defaultdict, Counter

train_path = 'ru_train.csv.zip'
test_path = 'ru_test_2.csv.zip'

t0 = time.time()
print('Building memorization map from train (before -> most frequent after) ...')
counts = defaultdict(Counter)
rows = 0
chunk_idx = 0
for chunk in pd.read_csv(train_path, compression='zip', usecols=['before','after'], chunksize=500_000):
    chunk_idx += 1
    for b, a in zip(chunk['before'].values, chunk['after'].values):
        counts[b][a] += 1
    rows += len(chunk)
    if chunk_idx % 5 == 0:
        print(f'  processed {rows:,} rows (chunks {chunk_idx}), unique befores: {len(counts):,}')

memorized = {}
for b, ctr in counts.items():
    memorized[b] = ctr.most_common(1)[0][0]
print(f'Map size: {len(memorized):,}. Build time: {time.time()-t0:.1f}s')

print('Loading test and generating predictions ...')
test = pd.read_csv(test_path, compression='zip')
test['id'] = test['sentence_id'].astype(str) + '_' + test['token_id'].astype(str)
t1 = time.time()
pred = test['before'].map(memorized)
missing = pred.isna().sum()
print(f'Mapped: {len(pred)-missing:,}, missing: {missing:,}')
pred = pred.fillna(test['before'])

submission = pd.DataFrame({'id': test['id'], 'after': pred})
submission.to_csv('submission.csv', index=False)
print('Saved submission.csv with', len(submission), 'rows. Total elapsed:', f'{time.time()-t0:.1f}s')

# Preview
print(submission.head())

Building memorization map from train (before -> most frequent after) ...


  processed 2,500,000 rows (chunks 5), unique befores: 324,298


  processed 5,000,000 rows (chunks 10), unique befores: 504,028


  processed 7,500,000 rows (chunks 15), unique befores: 649,053


  processed 9,515,325 rows (chunks 20), unique befores: 751,570


Map size: 751,570. Build time: 18.6s
Loading test and generating predictions ...


Mapped: 1,007,917, missing: 51,274


Saved submission.csv with 1059191 rows. Total elapsed: 21.5s
    id        after
0  0_0       Теперь
1  0_1          все
2  0_2  уважительно
3  0_3        зовут
4  0_4      Ямамото


In [4]:
# OOF CV for memorization baseline with <self> fix and unicode normalization
import pandas as pd, numpy as np, time, unicodedata
from sklearn.model_selection import GroupKFold

def normalize_before(x: str) -> str:
    if not isinstance(x, str):
        return x
    x = unicodedata.normalize('NFKC', x)
    x = x.replace('\u00A0', ' ').replace('\u2009', ' ')
    return x

t0 = time.time()
print('Loading full train for OOF...')
train = pd.read_csv('ru_train.csv.zip', compression='zip', usecols=['sentence_id','token_id','class','before','after'])
print('Train shape:', train.shape)

print('Normalizing before...')
train['before_norm'] = train['before'].map(normalize_before)

groups = train['sentence_id'].values
gkf = GroupKFold(n_splits=5)

oof_pred = pd.Series(index=train.index, dtype=object)
fold_acc = []
t_start = time.time()
for fold, (tr_idx, va_idx) in enumerate(gkf.split(train, groups=groups), 1):
    t_fold = time.time()
    print(f'Fold {fold}: train {len(tr_idx):,}, valid {len(va_idx):,}')
    tr = train.loc[tr_idx, ['before_norm','after']]
    # Build mapping: for each before_norm, pick most frequent after
    cnt = tr.value_counts(['before_norm','after']).reset_index(name='cnt')
    # For speed, sort and drop_duplicates to keep top by cnt
    cnt = cnt.sort_values(['before_norm','cnt'], ascending=[True, False])
    top = cnt.drop_duplicates('before_norm').set_index('before_norm')['after']
    va_before_norm = train.loc[va_idx, 'before_norm']
    mapped = va_before_norm.map(top)
    # <self> fix: when mapped is literal '<self>', replace by original before
    before_orig = train.loc[va_idx, 'before']
    mapped = np.where(mapped == '<self>', before_orig.values, mapped)
    # Fallback: if unmapped (NaN), use original before
    mapped = pd.Series(mapped, index=va_idx)
    na_mask = mapped.isna()
    if na_mask.any():
        mapped.loc[na_mask] = before_orig.loc[na_mask]
    oof_pred.loc[va_idx] = mapped
    # Accuracy
    y_true = train.loc[va_idx, 'after']
    acc = (mapped.values == y_true.values).mean()
    fold_acc.append(acc)
    print(f'  fold {fold} acc: {acc:.6f} | elapsed {(time.time()-t_fold):.1f}s')
    
overall_acc = (oof_pred.values == train['after'].values).mean()
print(f'OOF overall accuracy: {overall_acc:.6f}')
print('Per-class OOF (top 15 classes):')
cls_acc = train.assign(pred=oof_pred).groupby('class').apply(lambda df: (df['pred'].values == df['after'].values).mean()).sort_values(ascending=False)
print(cls_acc.head(15))
print('OOF done. Total elapsed:', f'{time.time()-t0:.1f}s')

# Optional: rebuild full mapping on all train for improved final baseline, with normalization and <self> fix
print('\nRebuilding full mapping with normalization for final baseline (no save here) ...')
cnt_all = train.value_counts(['before_norm','after']).reset_index(name='cnt')
cnt_all = cnt_all.sort_values(['before_norm','cnt'], ascending=[True, False])
full_map = cnt_all.drop_duplicates('before_norm').set_index('before_norm')['after']
print('Full map size:', len(full_map))
print('Example lookups:')
for ex in ['1', '12', '1999', '2000', 'г.']:
    key = normalize_before(ex)
    val = full_map.get(key, None)
    print(ex, '->', val)

Loading full train for OOF...


Train shape: (9515325, 5)
Normalizing before...


Fold 1: train 7,612,260, valid 1,903,065


  fold 1 acc: 0.972497 | elapsed 11.7s
Fold 2: train 7,612,260, valid 1,903,065


  fold 2 acc: 0.972404 | elapsed 11.6s
Fold 3: train 7,612,260, valid 1,903,065


  fold 3 acc: 0.972433 | elapsed 11.2s
Fold 4: train 7,612,260, valid 1,903,065


  fold 4 acc: 0.972655 | elapsed 11.0s
Fold 5: train 7,612,260, valid 1,903,065


  fold 5 acc: 0.972408 | elapsed 11.3s
OOF overall accuracy: 0.972479
Per-class OOF (top 15 classes):


  cls_acc = train.assign(pred=oof_pred).groupby('class').apply(lambda df: (df['pred'].values == df['after'].values).mean()).sort_values(ascending=False)


class
PUNCT         1.000000
VERBATIM      0.992550
PLAIN         0.984675
LETTERS       0.939792
DIGIT         0.900336
DATE          0.773572
CARDINAL      0.768191
TIME          0.700396
ORDINAL       0.550440
MEASURE       0.534407
FRACTION      0.517162
ELECTRONIC    0.418627
DECIMAL       0.360833
TELEPHONE     0.197771
MONEY         0.148087
dtype: float64
OOF done. Total elapsed: 78.2s

Rebuilding full mapping with normalization for final baseline (no save here) ...


Full map size: 751508
Example lookups:
1 -> один
12 -> двенадцать
1999 -> тысяча девятьсот девяносто девять
2000 -> две тысячи
г. -> None


In [7]:
# Rebuild submission with unicode-normalized mapping and <self> fix
import pandas as pd, time, unicodedata
from collections import defaultdict, Counter

def normalize_before(x: str) -> str:
    if not isinstance(x, str):
        return x
    x = unicodedata.normalize('NFKC', x)
    x = x.replace('\u00A0', ' ').replace('\u2009', ' ')
    return x

train_path = 'ru_train.csv.zip'
test_path = 'ru_test_2.csv.zip'

t0 = time.time()
print('Building normalized memorization map with <self> fix ...')
counts = defaultdict(Counter)
rows = 0
chunk_idx = 0
for chunk in pd.read_csv(train_path, compression='zip', usecols=['before','after'], chunksize=500_000):
    chunk_idx += 1
    bnorm = chunk['before'].map(normalize_before).values
    after = chunk['after'].values
    for b, a in zip(bnorm, after):
        counts[b][a] += 1
    rows += len(chunk)
    if chunk_idx % 5 == 0:
        print(f'  processed {rows:,} rows (chunks {chunk_idx}), unique normalized befores: {len(counts):,}')

memorized_norm = {}
for b, ctr in counts.items():
    memorized_norm[b] = ctr.most_common(1)[0][0]
print(f'Normalized map size: {len(memorized_norm):,}. Build time: {time.time()-t0:.1f}s')

print('Loading test and generating predictions with <self> fix ...')
test = pd.read_csv(test_path, compression='zip')
test['id'] = test['sentence_id'].astype(str) + '_' + test['token_id'].astype(str)
bn = test['before'].map(normalize_before)
pred = bn.map(memorized_norm)
missing = pred.isna().sum()
print(f'Mapped: {len(pred)-missing:,}, missing: {missing:,}')
# <self> -> original token
pred = pred.mask(pred == '<self>', test['before'])
# Fallback to identity
pred = pred.fillna(test['before'])

submission = pd.DataFrame({'id': test['id'], 'after': pred})
submission.to_csv('submission.csv', index=False)
print('Saved submission.csv with', len(submission), 'rows. Total elapsed:', f'{time.time()-t0:.1f}s')
print(submission.head())

Building normalized memorization map with <self> fix ...


  processed 2,500,000 rows (chunks 5), unique normalized befores: 324,280


  processed 5,000,000 rows (chunks 10), unique normalized befores: 503,997


  processed 7,500,000 rows (chunks 15), unique normalized befores: 648,998


  processed 9,515,325 rows (chunks 20), unique normalized befores: 751,509


Normalized map size: 751,509. Build time: 22.8s
Loading test and generating predictions with <self> fix ...


Mapped: 1,007,927, missing: 51,264


Saved submission.csv with 1059191 rows. Total elapsed: 26.4s
    id        after
0  0_0       Теперь
1  0_1          все
2  0_2  уважительно
3  0_3        зовут
4  0_4      Ямамото


In [5]:
# Improved generator: regex router + CARDINAL/DECIMAL using num2words(ru); fallback to normalized memorization
import re, time, unicodedata, subprocess, sys
import pandas as pd
from collections import defaultdict, Counter

# Install num2words if missing
try:
    import num2words
except ImportError:
    print('Installing num2words...')
    subprocess.run([sys.executable, '-m', 'pip', 'install', '-q', 'num2words'], check=True)
    import num2words

from num2words import num2words as n2w

def normalize_before(x: str) -> str:
    if not isinstance(x, str):
        return x
    x = unicodedata.normalize('NFKC', x)
    x = x.replace('\u00A0', ' ').replace('\u2009', ' ')
    return x

digit_words = ['ноль','один','два','три','четыре','пять','шесть','семь','восемь','девять']

int_re = re.compile(r'^-?\d{1,3}([ \u00A0]?\d{3})*$|^-?\d+$')
dec_re = re.compile(r'^-?\d+[\.,]\d+$')

def to_int_cardinal_ru(s: str) -> str | None:
    try:
        neg = s.strip().startswith('-')
        cleaned = s.replace(' ', '').replace('\u00A0','').lstrip('+')
        cleaned = cleaned[1:] if cleaned.startswith('-') else cleaned
        num = int(cleaned)
        spoken = n2w(num, lang='ru')
        if neg:
            spoken = 'минус ' + spoken
        return spoken
    except Exception:
        return None

def to_decimal_ru(s: str) -> str | None:
    try:
        neg = s.strip().startswith('-')
        t = s.strip().lstrip('+')
        t = t[1:] if t.startswith('-') else t
        if ',' in t:
            ip, fp = t.split(',', 1)
        else:
            ip, fp = t.split('.', 1)
        ip_spoken = to_int_cardinal_ru(ip) if ip else 'ноль'
        if ip_spoken is None:
            return None
        fp_digits = [digit_words[int(ch)] if ch.isdigit() else ch for ch in fp]
        frac_spoken = ' '.join(fp_digits)
        out = f"{ip_spoken} запятая {frac_spoken}"
        if neg:
            out = 'минус ' + out if not out.startswith('минус') else out
        return out
    except Exception:
        return None

def build_normalized_memorizer(train_path: str, chunksize: int = 500_000):
    counts = defaultdict(Counter)
    rows = 0
    chunk_idx = 0
    t0 = time.time()
    for chunk in pd.read_csv(train_path, compression='zip', usecols=['before','after'], chunksize=chunksize):
        chunk_idx += 1
        bnorm = chunk['before'].map(normalize_before).values
        after = chunk['after'].values
        for b, a in zip(bnorm, after):
            counts[b][a] += 1
        rows += len(chunk)
        if chunk_idx % 5 == 0:
            print(f'  memorizer processed {rows:,} rows (chunks {chunk_idx}), unique keys: {len(counts):,}')
    mem = {b: ctr.most_common(1)[0][0] for b, ctr in counts.items()}
    print(f'  memorizer size: {len(mem):,}. build {time.time()-t0:.1f}s')
    return mem

train_path = 'ru_train.csv.zip'
test_path = 'ru_test_2.csv.zip'

print('Building normalized memorizer (for fallback) ...')
memorized_norm = build_normalized_memorizer(train_path)

print('Loading test ...')
test = pd.read_csv(test_path, compression='zip')
test['id'] = test['sentence_id'].astype(str) + '_' + test['token_id'].astype(str)
bn = test['before'].map(normalize_before)

pred = []
t0 = time.time()
for i, (b_raw, b) in enumerate(zip(test['before'].values, bn.values)):
    out = None
    if isinstance(b, str):
        if int_re.match(b):
            out = to_int_cardinal_ru(b)
        elif dec_re.match(b):
            out = to_decimal_ru(b)
    # Fallback to memorized normalized map
    if out is None:
        out = memorized_norm.get(b, None)
        if out == '<self>' or out is None:
            out = b_raw
    pred.append(out)
    if (i+1) % 200000 == 0:
        print(f'  processed {i+1:,}/{len(test)} tokens...')

submission = pd.DataFrame({'id': test['id'], 'after': pred})
submission.to_csv('submission.csv', index=False)
print('Saved submission.csv with', len(submission), 'rows. Elapsed:', f'{time.time()-t0:.1f}s')
print(submission.head())

Installing num2words...




Building normalized memorizer (for fallback) ...


  memorizer processed 2,500,000 rows (chunks 5), unique keys: 324,280


  memorizer processed 5,000,000 rows (chunks 10), unique keys: 503,997


  memorizer processed 7,500,000 rows (chunks 15), unique keys: 648,998


  memorizer processed 9,515,325 rows (chunks 20), unique keys: 751,509


  memorizer size: 751,509. build 20.4s
Loading test ...


  processed 200,000/1059191 tokens...


  processed 400,000/1059191 tokens...


  processed 600,000/1059191 tokens...


  processed 800,000/1059191 tokens...


  processed 1,000,000/1059191 tokens...


Saved submission.csv with 1059191 rows. Elapsed: 2.9s
    id        after
0  0_0       Теперь
1  0_1          все
2  0_2  уважительно
3  0_3        зовут
4  0_4      Ямамото


In [6]:
# OOF CV for enhanced pipeline: integer/decimal generators + fold-built memorizer
import pandas as pd, numpy as np, time, unicodedata
from sklearn.model_selection import GroupKFold
from collections import defaultdict, Counter

# Reuse helpers from previous cells: normalize_before, int_re, dec_re, to_int_cardinal_ru, to_decimal_ru

def build_memorizer_from_df(df: pd.DataFrame) -> dict:
    cnt = df.value_counts(['before_norm','after']).reset_index(name='cnt')
    cnt = cnt.sort_values(['before_norm','cnt'], ascending=[True, False])
    return cnt.drop_duplicates('before_norm').set_index('before_norm')['after'].to_dict()

t0 = time.time()
print('Loading full train...')
train = pd.read_csv('ru_train.csv.zip', compression='zip', usecols=['sentence_id','token_id','class','before','after'])
train['before_norm'] = train['before'].map(normalize_before)
print('Train shape:', train.shape)

gkf = GroupKFold(n_splits=5)
groups = train['sentence_id'].values
oof_pred = pd.Series(index=train.index, dtype=object)
fold_acc = []

for fold, (tr_idx, va_idx) in enumerate(gkf.split(train, groups=groups), 1):
    t_fold = time.time()
    print(f'Fold {fold}: train {len(tr_idx):,}, valid {len(va_idx):,}')
    tr = train.loc[tr_idx, ['before_norm','after']]
    mem = build_memorizer_from_df(tr)
    va_bn = train.loc[va_idx, 'before_norm'].values
    va_braw = train.loc[va_idx, 'before'].values

    preds = []
    for b_raw, b in zip(va_braw, va_bn):
        out = None
        if isinstance(b, str):
            if int_re.match(b):
                out = to_int_cardinal_ru(b)
            elif dec_re.match(b):
                out = to_decimal_ru(b)
        if out is None:
            out = mem.get(b, None)
            if out == '<self>' or out is None:
                out = b_raw
        preds.append(out)
    preds = pd.Series(preds, index=va_idx)
    oof_pred.loc[va_idx] = preds
    acc = (preds.values == train.loc[va_idx, 'after'].values).mean()
    fold_acc.append(acc)
    print(f'  fold {fold} acc: {acc:.6f} | elapsed {(time.time()-t_fold):.1f}s')

overall = (oof_pred.values == train['after'].values).mean()
print(f'Enhanced OOF overall accuracy: {overall:.6f}')
print('Per-class OOF (top 15):')
cls_acc = train.assign(pred=oof_pred).groupby('class').apply(lambda df: (df['pred'].values == df['after'].values).mean()).sort_values(ascending=False)
print(cls_acc.head(15))
print('Done. Total elapsed:', f'{time.time()-t0:.1f}s')

Loading full train...


Train shape: (9515325, 6)


Fold 1: train 7,612,260, valid 1,903,065


  fold 1 acc: 0.966581 | elapsed 14.5s
Fold 2: train 7,612,260, valid 1,903,065


  fold 2 acc: 0.966732 | elapsed 14.3s
Fold 3: train 7,612,260, valid 1,903,065


  fold 3 acc: 0.966559 | elapsed 14.1s
Fold 4: train 7,612,260, valid 1,903,065


  fold 4 acc: 0.966855 | elapsed 14.2s
Fold 5: train 7,612,260, valid 1,903,065


  fold 5 acc: 0.966559 | elapsed 14.2s
Enhanced OOF overall accuracy: 0.966657
Per-class OOF (top 15):


class
PUNCT         1.000000
VERBATIM      0.992550
PLAIN         0.984675
LETTERS       0.939792
DATE          0.773572
TIME          0.677219
CARDINAL      0.555602
ORDINAL       0.543798
MEASURE       0.534407
FRACTION      0.517162
ELECTRONIC    0.418627
TELEPHONE     0.195232
DECIMAL       0.173200
MONEY         0.148087
DIGIT         0.000000
dtype: float64
Done. Total elapsed: 93.0s


  cls_acc = train.assign(pred=oof_pred).groupby('class').apply(lambda df: (df['pred'].values == df['after'].values).mean()).sort_values(ascending=False)


In [8]:
# OOF CV: correct backoff order (memorizer first, then generators) for INT/DEC
import pandas as pd, numpy as np, time
from sklearn.model_selection import GroupKFold

# Requires helpers from earlier cells: normalize_before, int_re, dec_re, to_int_cardinal_ru, to_decimal_ru

def build_memorizer_from_df(df: pd.DataFrame) -> dict:
    cnt = df.value_counts(['before_norm','after']).reset_index(name='cnt')
    cnt = cnt.sort_values(['before_norm','cnt'], ascending=[True, False])
    return cnt.drop_duplicates('before_norm').set_index('before_norm')['after'].to_dict()

t0 = time.time()
print('Loading full train...')
train = pd.read_csv('ru_train.csv.zip', compression='zip', usecols=['sentence_id','token_id','class','before','after'])
train['before_norm'] = train['before'].map(normalize_before)
print('Train shape:', train.shape)

gkf = GroupKFold(n_splits=5)
groups = train['sentence_id'].values
oof_pred = pd.Series(index=train.index, dtype=object)
fold_acc = []

for fold, (tr_idx, va_idx) in enumerate(gkf.split(train, groups=groups), 1):
    t_fold = time.time()
    print(f'Fold {fold}: train {len(tr_idx):,}, valid {len(va_idx):,}')
    tr = train.loc[tr_idx, ['before_norm','after']]
    mem = build_memorizer_from_df(tr)
    va_bn = train.loc[va_idx, 'before_norm'].values
    va_braw = train.loc[va_idx, 'before'].values

    preds = []
    for b_raw, b in zip(va_braw, va_bn):
        # 1) Memorizer first
        out = mem.get(b, None)
        if out == '<self>':
            out = None
        # 2) Generators if no confident memorized mapping
        if out is None and isinstance(b, str):
            if int_re.match(b):
                out = to_int_cardinal_ru(b)
            elif dec_re.match(b):
                out = to_decimal_ru(b)
        # 3) Fallbacks
        if out is None:
            out = b_raw
        preds.append(out)
    preds = pd.Series(preds, index=va_idx)
    oof_pred.loc[va_idx] = preds
    acc = (preds.values == train.loc[va_idx, 'after'].values).mean()
    fold_acc.append(acc)
    print(f'  fold {fold} acc: {acc:.6f} | elapsed {(time.time()-t_fold):.1f}s')

overall = (oof_pred.values == train['after'].values).mean()
print(f'Correct-backoff OOF overall accuracy: {overall:.6f}')
print('Per-class OOF (top 15):')
cls_acc = train.assign(pred=oof_pred).groupby('class').apply(lambda df: (df['pred'].values == df['after'].values).mean()).sort_values(ascending=False)
print(cls_acc.head(15))
print('Done. Total elapsed:', f'{time.time()-t0:.1f}s')

Loading full train...


Train shape: (9515325, 6)


Fold 1: train 7,612,260, valid 1,903,065


  fold 1 acc: 0.973045 | elapsed 14.3s
Fold 2: train 7,612,260, valid 1,903,065


  fold 2 acc: 0.972953 | elapsed 12.9s
Fold 3: train 7,612,260, valid 1,903,065


  fold 3 acc: 0.972978 | elapsed 13.1s
Fold 4: train 7,612,260, valid 1,903,065


  fold 4 acc: 0.973220 | elapsed 12.8s
Fold 5: train 7,612,260, valid 1,903,065


  fold 5 acc: 0.972938 | elapsed 13.2s
Correct-backoff OOF overall accuracy: 0.973027
Per-class OOF (top 15):


class
PUNCT         1.000000
VERBATIM      0.992550
PLAIN         0.984675
LETTERS       0.939792
DIGIT         0.900336
CARDINAL      0.789590
DATE          0.773572
TIME          0.700396
ORDINAL       0.550440
MEASURE       0.534407
FRACTION      0.517162
ELECTRONIC    0.418627
DECIMAL       0.353692
TELEPHONE     0.197771
MONEY         0.148087
dtype: float64
Done. Total elapsed: 88.9s


  cls_acc = train.assign(pred=oof_pred).groupby('class').apply(lambda df: (df['pred'].values == df['after'].values).mean()).sort_values(ascending=False)


In [9]:
# Inference v2: memorizer-first + regex router for DATE/TIME/DECIMAL + CARDINAL
import re, time, unicodedata, subprocess, sys
import pandas as pd
from collections import defaultdict, Counter

# Ensure num2words installed
try:
    from num2words import num2words as n2w
except Exception:
    print('Installing num2words...')
    subprocess.run([sys.executable, '-m', 'pip', 'install', '-q', 'num2words'], check=True)
    from num2words import num2words as n2w

def normalize_before(x: str) -> str:
    if not isinstance(x, str):
        return x
    x = unicodedata.normalize('NFKC', x)
    x = x.replace('\u00A0', ' ').replace('\u2009', ' ')
    return x

digit_words = ['ноль','один','два','три','четыре','пять','шесть','семь','восемь','девять']

def plural_form(n, one, two_four, many):
    try:
        n = abs(int(n))
    except Exception:
        return many
    if 11 <= n % 100 <= 14:
        return many
    if n % 10 == 1:
        return one
    if n % 10 in (2,3,4):
        return two_four
    return many

# Regexes
int_re = re.compile(r'^-?\d{1,3}(?:[ \u00A0]?\d{3})*$|^-?\d+$')
dec_re = re.compile(r'^-?\d+[\.,]\d+$')
date_ddmmyyyy_re = re.compile(r'^(0[1-9]|[12]\d|3[01])\.(0[1-9]|1[0-2])\.(\d{4})$')
date_d_month_yyyy_re = re.compile(r'^([1-9]|[12]\d|3[01])\s+(января|февраля|марта|апреля|мая|июня|июля|августа|сентября|октября|ноября|декабря)\s+(\d{4})$', re.I)
date_year_word_re = re.compile(r'^(\d{4})\s*(г\.|год)$', re.I)
time_hhmm_re = re.compile(r'^(?:[01]?\d|2[0-3]):[0-5]\d$')
ip_or_version_re = re.compile(r'^\d+(?:\.\d+){2,}$')
has_letter_re = re.compile(r'[A-Za-zА-Яа-я]')
url_email_re = re.compile(r'^[a-z0-9._%+\-]+@[a-z0-9.\-]+\.[a-z]{2,}$|^https?://', re.I)

months_gen = ['января','февраля','марта','апреля','мая','июня','июля','августа','сентября','октября','ноября','декабря']
day_ord_gen = {
 1:'первого',2:'второго',3:'третьего',4:'четвертого',5:'пятого',6:'шестого',7:'седьмого',8:'восьмого',9:'девятого',10:'десятого',
 11:'одиннадцатого',12:'двенадцатого',13:'тринадцатого',14:'четырнадцатого',15:'пятнадцатого',16:'шестнадцатого',17:'семнадцатого',18:'восемнадцатого',19:'девятнадцатого',
 20:'двадцатого',21:'двадцать первого',22:'двадцать второго',23:'двадцать третьего',24:'двадцать четвертого',25:'двадцать пятого',26:'двадцать шестого',27:'двадцать седьмого',28:'двадцать восьмого',29:'двадцать девятого',
 30:'тридцатого',31:'тридцать первого'
}

def to_int_cardinal_ru(s: str) -> str | None:
    try:
        neg = s.strip().startswith('-')
        cleaned = s.replace(' ', '').replace('\u00A0','').lstrip('+')
        cleaned = cleaned[1:] if cleaned.startswith('-') else cleaned
        num = int(cleaned)
        spoken = n2w(num, lang='ru')
        if neg:
            spoken = 'минус ' + spoken
        return spoken
    except Exception:
        return None

def to_decimal_ru(s: str) -> str | None:
    try:
        if url_email_re.search(s) or ip_or_version_re.match(s) or has_letter_re.search(s):
            return None
        # Avoid interpreting dd.mm.yyyy as decimal
        if date_ddmmyyyy_re.match(s):
            return None
        neg = s.strip().startswith('-')
        t = s.strip().lstrip('+')
        t = t[1:] if t.startswith('-') else t
        if ',' in t:
            ip, fp = t.split(',', 1)
        else:
            ip, fp = t.split('.', 1)
        ip_spoken = to_int_cardinal_ru(ip) if ip else 'ноль'
        if ip_spoken is None:
            return None
        fp_digits = [digit_words[int(ch)] if ch.isdigit() else ch for ch in fp if ch.isdigit()]
        if not fp_digits:
            return None
        frac_spoken = ' '.join(fp_digits)
        out = f"{ip_spoken} запятая {frac_spoken}"
        if neg:
            out = 'минус ' + out if not out.startswith('минус') else out
        return out
    except Exception:
        return None

def masc_nom_to_gen(word: str) -> str:
    # crude but effective for year ordinals: replace last adjective ending
    if word.endswith('ый') or word.endswith('ий') or word.endswith('ой'):
        return word[:-2] + 'ого'
    return word

def ordinal_phrase_nom_to_gen(phrase: str) -> str:
    # convert last token's ending to genitive masculine
    parts = phrase.split()
    if not parts:
        return phrase
    parts[-1] = masc_nom_to_gen(parts[-1])
    return ' '.join(parts)

def year_to_ord_gen(y: int) -> str | None:
    try:
        if 1000 <= y <= 1999:
            rest = y - 1000
            if rest == 0:
                base = 'тысяча'
                # 'тысячного' uncommon; prefer memorizer if exists; return None
                return None
            rest_ord = n2w(rest, lang='ru', to='ordinal')
            rest_ord_gen = ordinal_phrase_nom_to_gen(rest_ord)
            return f'тысяча {rest_ord_gen}'
        if y == 2000:
            return 'двухтысячного'
        if 2001 <= y <= 2099:
            rest = y - 2000
            rest_ord = n2w(rest, lang='ru', to='ordinal')
            rest_ord_gen = ordinal_phrase_nom_to_gen(rest_ord)
            return f'две тысячи {rest_ord_gen}'
        # outside safe zone: skip to memorizer
        return None
    except Exception:
        return None

def to_date_ru(s: str) -> str | None:
    m = date_ddmmyyyy_re.match(s)
    if m:
        d = int(m.group(1))
        mo = int(m.group(2))
        y = int(m.group(3))
        d_sp = day_ord_gen.get(d)
        if d_sp is None:
            return None
        month_sp = months_gen[mo-1]
        y_sp = year_to_ord_gen(y)
        if y_sp is None:
            return None
        return f"{d_sp} {month_sp} {y_sp} года"
    m = date_d_month_yyyy_re.match(s)
    if m:
        d = int(m.group(1))
        month_word = m.group(2).lower()
        y = int(m.group(3))
        d_sp = day_ord_gen.get(d)
        if d_sp is None or month_word not in months_gen:
            return None
        y_sp = year_to_ord_gen(y)
        if y_sp is None:
            return None
        return f"{d_sp} {month_word} {y_sp} года"
    m = date_year_word_re.match(s)
    if m:
        y = int(m.group(1))
        y_sp = year_to_ord_gen(y)
        if y_sp is None:
            return None
        # Prefer 'года' per train style
        return f"{y_sp} года"
    return None

def to_time_ru(s: str) -> str | None:
    m = time_hhmm_re.match(s)
    if not m:
        return None
    try:
        hh, mm = s.split(':')
        h = int(hh)
        m_ = int(mm)
        h_sp = n2w(h, lang='ru')
        h_unit = plural_form(h, 'час', 'часа', 'часов')
        if m_ == 0:
            m_sp = 'ноль'
            m_unit = 'минут'
        else:
            m_sp = n2w(m_, lang='ru')
            m_unit = plural_form(m_, 'минута', 'минуты', 'минут')
        return f"{h_sp} {h_unit} {m_sp} {m_unit}"
    except Exception:
        return None

def build_normalized_memorizer(train_path: str, chunksize: int = 500_000):
    counts = defaultdict(Counter)
    rows = 0
    chunk_idx = 0
    t0 = time.time()
    for chunk in pd.read_csv(train_path, compression='zip', usecols=['before','after'], chunksize=chunksize):
        chunk_idx += 1
        bnorm = chunk['before'].map(normalize_before).values
        after = chunk['after'].values
        for b, a in zip(bnorm, after):
            counts[b][a] += 1
        rows += len(chunk)
        if chunk_idx % 5 == 0:
            print(f'  memorizer processed {rows:,} rows (chunks {chunk_idx}), unique keys: {len(counts):,}')
    mem = {b: ctr.most_common(1)[0][0] for b, ctr in counts.items()}
    print(f'  memorizer size: {len(mem):,}. build {time.time()-t0:.1f}s')
    return mem

train_path = 'ru_train.csv.zip'
test_path = 'ru_test_2.csv.zip'

print('Building normalized memorizer (for first backoff) ...')
memorized_norm = build_normalized_memorizer(train_path)

print('Loading test ...')
test = pd.read_csv(test_path, compression='zip')
test['id'] = test['sentence_id'].astype(str) + '_' + test['token_id'].astype(str)
bn = test['before'].map(normalize_before)

pred = []
t0 = time.time()
for i, (b_raw, b) in enumerate(zip(test['before'].values, bn.values)):
    out = None
    # 0) memorizer first
    if isinstance(b, str):
        out = memorized_norm.get(b, None)
        if out == '<self>':
            out = None
    # 1) DATE
    if out is None and isinstance(b, str):
        out = to_date_ru(b)
    # 2) TIME
    if out is None and isinstance(b, str):
        out = to_time_ru(b)
    # 3) DECIMAL with guards
    if out is None and isinstance(b, str):
        if dec_re.match(b) and not ip_or_version_re.match(b) and not has_letter_re.search(b) and not url_email_re.search(b):
            out = to_decimal_ru(b)
    # 4) CARDINAL
    if out is None and isinstance(b, str) and int_re.match(b) and not url_email_re.search(b):
        out = to_int_cardinal_ru(b)
    # 5) fallback identity if memorizer had nothing or returned None
    if out is None:
        # try memorizer again (if it wasn't <self>)
        if isinstance(b, str):
            tmp = memorized_norm.get(b, None)
            if tmp is not None and tmp != '<self>':
                out = tmp
    if out is None:
        out = b_raw
    pred.append(out)
    if (i+1) % 200000 == 0:
        print(f'  processed {i+1:,}/{len(test)} tokens...')

submission = pd.DataFrame({'id': test['id'], 'after': pred})
submission.to_csv('submission.csv', index=False)
print('Saved submission.csv with', len(submission), 'rows. Elapsed:', f'{time.time()-t0:.1f}s')
print(submission.head())

Building normalized memorizer (for first backoff) ...


  memorizer processed 2,500,000 rows (chunks 5), unique keys: 324,280


  memorizer processed 5,000,000 rows (chunks 10), unique keys: 503,997


  memorizer processed 7,500,000 rows (chunks 15), unique keys: 648,998


  memorizer processed 9,515,325 rows (chunks 20), unique keys: 751,509


  memorizer size: 751,509. build 21.3s
Loading test ...


  processed 200,000/1059191 tokens...
  processed 400,000/1059191 tokens...


  processed 600,000/1059191 tokens...
  processed 800,000/1059191 tokens...


  processed 1,000,000/1059191 tokens...


Saved submission.csv with 1059191 rows. Elapsed: 2.2s
    id        after
0  0_0       Теперь
1  0_1          все
2  0_2  уважительно
3  0_3        зовут
4  0_4      Ямамото


In [10]:
# OOF CV for v2 router: memorizer-first + DATE/TIME/DECIMAL/CARDINAL
import pandas as pd, numpy as np, time
from sklearn.model_selection import GroupKFold
from collections import defaultdict, Counter

# Reuse helpers from cell 8: normalize_before, regexes, and converters

def build_memorizer_from_df(df: pd.DataFrame) -> dict:
    cnt = df.value_counts(['before_norm','after']).reset_index(name='cnt')
    cnt = cnt.sort_values(['before_norm','cnt'], ascending=[True, False])
    return cnt.drop_duplicates('before_norm').set_index('before_norm')['after'].to_dict()

t0 = time.time()
print('Loading full train...')
train = pd.read_csv('ru_train.csv.zip', compression='zip', usecols=['sentence_id','token_id','class','before','after'])
train['before_norm'] = train['before'].map(normalize_before)
print('Train shape:', train.shape)

gkf = GroupKFold(n_splits=5)
groups = train['sentence_id'].values
oof_pred = pd.Series(index=train.index, dtype=object)
fold_acc = []

for fold, (tr_idx, va_idx) in enumerate(gkf.split(train, groups=groups), 1):
    t_fold = time.time()
    print(f'Fold {fold}: train {len(tr_idx):,}, valid {len(va_idx):,}')
    tr = train.loc[tr_idx, ['before_norm','after']]
    mem = build_memorizer_from_df(tr)
    va_bn = train.loc[va_idx, 'before_norm'].values
    va_braw = train.loc[va_idx, 'before'].values

    preds = []
    for b_raw, b in zip(va_braw, va_bn):
        out = None
        # 0) memorizer first
        if isinstance(b, str):
            out = mem.get(b, None)
            if out == '<self>':
                out = None
        # 1) DATE
        if out is None and isinstance(b, str):
            out = to_date_ru(b)
        # 2) TIME
        if out is None and isinstance(b, str):
            out = to_time_ru(b)
        # 3) DECIMAL (with guards, same as inference)
        if out is None and isinstance(b, str):
            if dec_re.match(b) and not ip_or_version_re.match(b) and not has_letter_re.search(b) and not url_email_re.search(b):
                out = to_decimal_ru(b)
        # 4) CARDINAL
        if out is None and isinstance(b, str) and int_re.match(b) and not url_email_re.search(b):
            out = to_int_cardinal_ru(b)
        # 5) fallback to memorizer if exists and not <self>, else original
        if out is None and isinstance(b, str):
            tmp = mem.get(b, None)
            if tmp is not None and tmp != '<self>':
                out = tmp
        if out is None:
            out = b_raw
        preds.append(out)
    preds = pd.Series(preds, index=va_idx)
    oof_pred.loc[va_idx] = preds
    acc = (preds.values == train.loc[va_idx, 'after'].values).mean()
    fold_acc.append(acc)
    print(f'  fold {fold} acc: {acc:.6f} | elapsed {(time.time()-t_fold):.1f}s')

overall = (oof_pred.values == train['after'].values).mean()
print(f'v2 router OOF overall accuracy: {overall:.6f}')
print('Per-class OOF (top 20):')
cls_acc = train.assign(pred=oof_pred).groupby('class').apply(lambda df: (df['pred'].values == df['after'].values).mean()).sort_values(ascending=False)
print(cls_acc.head(20))
print('Done. Total elapsed:', f'{time.time()-t0:.1f}s')

Loading full train...


Train shape: (9515325, 6)


Fold 1: train 7,612,260, valid 1,903,065


  fold 1 acc: 0.973343 | elapsed 14.1s
Fold 2: train 7,612,260, valid 1,903,065


  fold 2 acc: 0.973258 | elapsed 13.4s
Fold 3: train 7,612,260, valid 1,903,065


  fold 3 acc: 0.973276 | elapsed 13.1s
Fold 4: train 7,612,260, valid 1,903,065


  fold 4 acc: 0.973546 | elapsed 13.1s
Fold 5: train 7,612,260, valid 1,903,065


  fold 5 acc: 0.973244 | elapsed 13.2s
v2 router OOF overall accuracy: 0.973333
Per-class OOF (top 20):


class
PUNCT         1.000000
VERBATIM      0.992550
PLAIN         0.984675
LETTERS       0.939792
DIGIT         0.900336
TIME          0.872244
CARDINAL      0.789590
DATE          0.789204
ORDINAL       0.550440
MEASURE       0.534407
FRACTION      0.517162
ELECTRONIC    0.418627
DECIMAL       0.353692
TELEPHONE     0.197771
MONEY         0.148087
dtype: float64
Done. Total elapsed: 89.1s


  cls_acc = train.assign(pred=oof_pred).groupby('class').apply(lambda df: (df['pred'].values == df['after'].values).mean()).sort_values(ascending=False)


In [14]:
# Inference v3: add MONEY and MEASURE handlers; memorizer-first routing
import re, time, unicodedata, subprocess, sys
import pandas as pd
from collections import defaultdict, Counter

# Reuse helpers from previous cells (installed already)
from num2words import num2words as n2w

def normalize_before(x: str) -> str:
    if not isinstance(x, str):
        return x
    x = unicodedata.normalize('NFKC', x)
    x = x.replace('\u00A0', ' ').replace('\u2009', ' ')
    return x

digit_words = ['\u043d\u043e\u043b\u044c','\u043e\u0434\u0438\u043d','\u0434\u0432\u0430','\u0442\u0440\u0438','\u0447\u0435\u0442\u044b\u0440\u0435','\u043f\u044f\u0442\u044c','\u0448\u0435\u0441\u0442\u044c','\u0441\u0435\u043c\u044c','\u0432\u043e\u0441\u0435\u043c\u044c','\u0434\u0435\u0432\u044f\u0442\u044c']

def plural_form(n, one, two_four, many):
    try:
        n = abs(int(float(n)))
    except Exception:
        return many
    if 11 <= n % 100 <= 14:
        return many
    if n % 10 == 1:
        return one
    if n % 10 in (2,3,4):
        return two_four
    return many

# Regexes (reuse from v2 where applicable)
int_re = re.compile(r'^-?\d{1,3}(?:[ \u00A0]?\d{3})*$|^-?\d+$')
dec_re = re.compile(r'^-?\d+[\.,]\d+$')
date_ddmmyyyy_re = re.compile(r'^(0[1-9]|[12]\d|3[01])\.(0[1-9]|1[0-2])\.(\d{4})$')
date_d_month_yyyy_re = re.compile(r'^([1-9]|[12]\d|3[01])\s+(\u044f\u043d\u0432\u0430\u0440\u044f|\u0444\u0435\u0432\u0440\u0430\u043b\u044f|\u043c\u0430\u0440\u0442\u0430|\u0430\u043f\u0440\u0435\u043b\u044f|\u043c\u0430\u044f|\u0438\u044e\u043d\u044f|\u0438\u044e\u043b\u044f|\u0430\u0432\u0433\u0443\u0441\u0442\u0430|\u0441\u0435\u043d\u0442\u044f\u0431\u0440\u044f|\u043e\u043a\u0442\u044f\u0431\u0440\u044f|\u043d\u043e\u044f\u0431\u0440\u044f|\u0434\u0435\u043a\u0430\u0431\u0440\u044f)\s+(\d{4})$', re.I)
date_year_word_re = re.compile(r'^(\d{4})\s*(\u0433\.|\u0433\u043e\u0434)$', re.I)
time_hhmm_re = re.compile(r'^(?:[01]?\d|2[0-3]):[0-5]\d$')
ip_or_version_re = re.compile(r'^\d+(?:\.\d+){2,}$')
has_letter_re = re.compile(r'[A-Za-z\u0410-\u042f\u0430-\u044f]')
url_email_re = re.compile(r'^[a-z0-9._%+\-]+@[a-z0-9.\-]+\.[a-z]{2,}$|^https?://', re.I)
# MONEY and MEASURE
money_re = re.compile(r'^-?\d+(?:[ \u00A0]?\d{3})*(?:[\.,]\d{1,2})?\s*(₽|\u20BD|руб\.?|р\.?|RUB|\$|USD|€|EUR)$', re.I)
measure_re = re.compile(r'^-?\d+(?:[ \u00A0]?\d{3})*(?:[\.,]\d+)?\s*(км|кг|м|%|°C|°С)$', re.I)

months_gen = ['\u044f\u043d\u0432\u0430\u0440\u044f','\u0444\u0435\u0432\u0440\u0430\u043b\u044f','\u043c\u0430\u0440\u0442\u0430','\u0430\u043f\u0440\u0435\u043b\u044f','\u043c\u0430\u044f','\u0438\u044e\u043d\u044f','\u0438\u044e\u043b\u044f','\u0430\u0432\u0433\u0443\u0441\u0442\u0430','\u0441\u0435\u043d\u0442\u044f\u0431\u0440\u044f','\u043e\u043a\u0442\u044f\u0431\u0440\u044f','\u043d\u043e\u044f\u0431\u0440\u044f','\u0434\u0435\u043a\u0430\u0431\u0440\u044f']
day_ord_gen = {1:'\u043f\u0435\u0440\u0432\u043e\u0433\u043e',2:'\u0432\u0442\u043e\u0440\u043e\u0433\u043e',3:'\u0442\u0440\u0435\u0442\u044c\u0435\u0433\u043e',4:'\u0447\u0435\u0442\u0432\u0435\u0440\u0442\u043e\u0433\u043e',5:'\u043f\u044f\u0442\u043e\u0433\u043e',6:'\u0448\u0435\u0441\u0442\u043e\u0433\u043e',7:'\u0441\u0435\u0434\u044c\u043c\u043e\u0433\u043e',8:'\u0432\u043e\u0441\u044c\u043c\u043e\u0433\u043e',9:'\u0434\u0435\u0432\u044f\u0442\u043e\u0433\u043e',10:'\u0434\u0435\u0441\u044f\u0442\u043e\u0433\u043e',11:'\u043e\u0434\u0438\u043d\u043d\u0430\u0434\u0446\u0430\u0442\u043e\u0433\u043e',12:'\u0434\u0432\u0435\u043d\u0430\u0434\u0446\u0430\u0442\u043e\u0433\u043e',13:'\u0442\u0440\u0438\u043d\u0430\u0434\u0446\u0430\u0442\u043e\u0433\u043e',14:'\u0447\u0435\u0442\u044b\u0440\u043d\u0430\u0434\u0446\u0430\u0442\u043e\u0433\u043e',15:'\u043f\u044f\u0442\u043d\u0430\u0434\u0446\u0430\u0442\u043e\u0433\u043e',16:'\u0448\u0435\u0441\u0442\u043d\u0430\u0434\u0446\u0430\u0442\u043e\u0433\u043e',17:'\u0441\u0435\u043c\u043d\u0430\u0434\u0446\u0430\u0442\u043e\u0433\u043e',18:'\u0432\u043e\u0441\u0435\u043c\u043d\u0430\u0434\u0446\u0430\u0442\u043e\u0433\u043e',19:'\u0434\u0435\u0432\u044f\u0442\u043d\u0430\u0434\u0446\u0430\u0442\u043e\u0433\u043e',20:'\u0434\u0432\u0430\u0434\u0446\u0430\u0442\u043e\u0433\u043e',21:'\u0434\u0432\u0430\u0434\u0446\u0430\u0442\u044c \u043f\u0435\u0440\u0432\u043e\u0433\u043e',22:'\u0434\u0432\u0430\u0434\u0446\u0430\u0442\u044c \u0432\u0442\u043e\u0440\u043e\u0433\u043e',23:'\u0434\u0432\u0430\u0434\u0446\u0430\u0442\u044c \u0442\u0440\u0435\u0442\u044c\u0435\u0433\u043e',24:'\u0434\u0432\u0430\u0434\u0446\u0430\u0442\u044c \u0447\u0435\u0442\u0432\u0435\u0440\u0442\u043e\u0433\u043e',25:'\u0434\u0432\u0430\u0434\u0446\u0430\u0442\u044c \u043f\u044f\u0442\u043e\u0433\u043e',26:'\u0434\u0432\u0430\u0434\u0446\u0430\u0442\u044c \u0448\u0435\u0441\u0442\u043e\u0433\u043e',27:'\u0434\u0432\u0430\u0434\u0446\u0430\u0442\u044c \u0441\u0435\u0434\u044c\u043c\u043e\u0433\u043e',28:'\u0434\u0432\u0430\u0434\u0446\u0430\u0442\u044c \u0432\u043e\u0441\u044c\u043c\u043e\u0433\u043e',29:'\u0434\u0432\u0430\u0434\u0446\u0430\u0442\u044c \u0434\u0435\u0432\u044f\u0442\u043e\u0433\u043e',30:'\u0442\u0440\u0438\u0434\u0446\u0430\u0442\u043e\u0433\u043e',31:'\u0442\u0440\u0438\u0434\u0446\u0430\u0442\u044c \u043f\u0435\u0440\u0432\u043e\u0433\u043e'}

def to_int_cardinal_ru(s: str) -> str | None:
    try:
        neg = s.strip().startswith('-')
        cleaned = s.replace(' ', '').replace('\u00A0','').lstrip('+')
        cleaned = cleaned[1:] if cleaned.startswith('-') else cleaned
        num = int(cleaned)
        spoken = n2w(num, lang='ru')
        if neg:
            spoken = '\u043c\u0438\u043d\u0443\u0441 ' + spoken
        return spoken
    except Exception:
        return None

def to_decimal_ru(s: str) -> str | None:
    try:
        if url_email_re.search(s) or ip_or_version_re.match(s) or has_letter_re.search(s):
            return None
        if date_ddmmyyyy_re.match(s):
            return None
        neg = s.strip().startswith('-')
        t = s.strip().lstrip('+')
        t = t[1:] if t.startswith('-') else t
        if ',' in t:
            ip, fp = t.split(',', 1)
        else:
            ip, fp = t.split('.', 1)
        ip_spoken = to_int_cardinal_ru(ip) if ip else '\u043d\u043e\u043b\u044c'
        if ip_spoken is None:
            return None
        fp_digits = [digit_words[int(ch)] for ch in fp if ch.isdigit()]
        if not fp_digits:
            return None
        frac_spoken = ' '.join(fp_digits)
        out = f"{ip_spoken} \u0437\u0430\u043f\u044f\u0442\u0430\u044f {frac_spoken}"
        if neg:
            out = '\u043c\u0438\u043d\u0443\u0441 ' + out if not out.startswith('\u043c\u0438\u043d\u0443\u0441') else out
        return out
    except Exception:
        return None

def masc_nom_to_gen(word: str) -> str:
    if word.endswith('\u044b\u0439') or word.endswith('\u0438\u0439') or word.endswith('\u043e\u0439'):
        return word[:-2] + '\u043e\u0433\u043e'
    return word

def ordinal_phrase_nom_to_gen(phrase: str) -> str:
    parts = phrase.split()
    if not parts:
        return phrase
    parts[-1] = masc_nom_to_gen(parts[-1])
    return ' '.join(parts)

def year_to_ord_gen(y: int) -> str | None:
    try:
        if 1000 <= y <= 1999:
            rest = y - 1000
            if rest == 0:
                return None
            rest_ord = n2w(rest, lang='ru', to='ordinal')
            rest_ord_gen = ordinal_phrase_nom_to_gen(rest_ord)
            return f'\u0442\u044b\u0441\u044f\u0447\u0430 {rest_ord_gen}'
        if y == 2000:
            return '\u0434\u0432\u0443\u0445\u0442\u044b\u0441\u044f\u0447\u043d\u043e\u0433\u043e'
        if 2001 <= y <= 2099:
            rest = y - 2000
            rest_ord = n2w(rest, lang='ru', to='ordinal')
            rest_ord_gen = ordinal_phrase_nom_to_gen(rest_ord)
            return f'\u0434\u0432\u0435 \u0442\u044b\u0441\u044f\u0447\u0438 {rest_ord_gen}'
        return None
    except Exception:
        return None

def to_date_ru(s: str) -> str | None:
    m = date_ddmmyyyy_re.match(s)
    if m:
        d = int(m.group(1)); mo = int(m.group(2)); y = int(m.group(3))
        d_sp = day_ord_gen.get(d); month_sp = months_gen[mo-1]; y_sp = year_to_ord_gen(y)
        if d_sp and y_sp:
            return f"{d_sp} {month_sp} {y_sp} \u0433\u043e\u0434\u0430"
        return None
    m = date_d_month_yyyy_re.match(s)
    if m:
        d = int(m.group(1)); month_word = m.group(2).lower(); y = int(m.group(3))
        d_sp = day_ord_gen.get(d); y_sp = year_to_ord_gen(y)
        if d_sp and y_sp and month_word in months_gen:
            return f"{d_sp} {month_word} {y_sp} \u0433\u043e\u0434\u0430"
        return None
    m = date_year_word_re.match(s)
    if m:
        y = int(m.group(1)); suf = m.group(2).lower(); y_sp = year_to_ord_gen(y)
        if y_sp:
            # If original contains explicit 'год', use nominative 'год'; for 'г.' prefer 'год' as well to match train example style
            if 'год' in suf:
                return f"{y_sp} \u0433\u043e\u0434"
            else:
                return f"{y_sp} \u0433\u043e\u0434"
    return None

def to_time_ru(s: str) -> str | None:
    m = time_hhmm_re.match(s)
    if not m:
        return None
    try:
        hh, mm = s.split(':')
        h = int(hh); m_ = int(mm)
        h_sp = n2w(h, lang='ru')
        h_unit = plural_form(h, '\u0447\u0430\u0441', '\u0447\u0430\u0441\u0430', '\u0447\u0430\u0441\u043e\u0432')
        if m_ == 0:
            m_sp = '\u043d\u043e\u043b\u044c'; m_unit = '\u043c\u0438\u043d\u0443\u0442'
        else:
            m_sp = n2w(m_, lang='ru'); m_unit = plural_form(m_, '\u043c\u0438\u043d\u0443\u0442\u0430', '\u043c\u0438\u043d\u0443\u0442\u044b', '\u043c\u0438\u043d\u0443\u0442')
        return f"{h_sp} {h_unit} {m_sp} {m_unit}"
    except Exception:
        return None

def to_money_ru(s: str) -> str | None:
    m = money_re.match(s)
    if not m:
        return None
    try:
        cur = m.group(1).upper() if m.group(1) else ''
        # extract numeric part (everything before currency)
        num_part = s[:m.start(1)].strip() if m.lastindex else s.strip()
        t = num_part.replace('\u00A0',' ').replace(' ', '').replace('+','')
        neg = t.startswith('-'); t = t[1:] if neg else t
        if ',' in t: ip, fp = t.split(',', 1)
        elif '.' in t: ip, fp = t.split('.', 1)
        else: ip, fp = t, ''
        r = int(ip) if ip else 0
        r_sp = n2w(r, lang='ru') if r != 0 else '\u043d\u043e\u043b\u044c'
        if cur in {'₽','\u20BD','РУБ.','РУБ','Р.','RUB'}:
            rub_unit = plural_form(r, '\u0440\u0443\u0431\u043b\u044c','\u0440\u0443\u0431\u043b\u044f','\u0440\u0443\u0431\u043b\u0435\u0439')
            out = f"{r_sp} {rub_unit}"
            k = int(fp[:2]) if fp else 0
            if fp and len(fp) == 1:
                k *= 10
            if k > 0:
                k_sp = n2w(k, lang='ru')
                kop_unit = plural_form(k, '\u043a\u043e\u043f\u0435\u0439\u043a\u0430','\u043a\u043e\u043f\u0435\u0439\u043a\u0438','\u043a\u043e\u043f\u0435\u0435\u043a')
                out = f"{out} {k_sp} {kop_unit}"
            if neg:
                out = '\u043c\u0438\u043d\u0443\u0441 ' + out
            return out
        if cur in {'$','USD'}:
            unit = ('\u0434\u043e\u043b\u043b\u0430\u0440','\u0434\u043e\u043b\u043b\u0430\u0440\u0430','\u0434\u043e\u043b\u043b\u0430\u0440\u043e\u0432')
            out = f"{r_sp} {plural_form(r,*unit)}"
            c = int(fp[:2]) if fp else 0
            if fp and len(fp) == 1:
                c *= 10
            if c > 0:
                c_sp = n2w(c, lang='ru')
                cent_unit = plural_form(c, '\u0446\u0435\u043d\u0442','\u0446\u0435\u043d\u0442\u0430','\u0446\u0435\u043d\u0442\u043e\u0432')
                out = f"{out} {c_sp} {cent_unit}"
            if neg: out = '\u043c\u0438\u043d\u0443\u0441 ' + out
            return out
        if cur in {'€','EUR'}:
            # евро invariable; cents as above
            out = f"{r_sp} \u0435\u0432\u0440\u043e"
            c = int(fp[:2]) if fp else 0
            if fp and len(fp) == 1: c *= 10
            if c > 0:
                c_sp = n2w(c, lang='ru')
                cent_unit = plural_form(c, '\u0446\u0435\u043d\u0442','\u0446\u0435\u043d\u0442\u0430','\u0446\u0435\u043d\u0442\u043e\u0432')
                out = f"{out} {c_sp} {cent_unit}"
            if neg: out = '\u043c\u0438\u043d\u0443\u0441 ' + out
            return out
        return None
    except Exception:
        return None

unit_forms = {
    '\u043a\u043c': ('\u043a\u0438\u043b\u043e\u043c\u0435\u0442\u0440','\u043a\u0438\u043b\u043e\u043c\u0435\u0442\u0440\u0430','\u043a\u0438\u043b\u043e\u043c\u0435\u0442\u0440\u043e\u0432'),
    '\u043a\u0433': ('\u043a\u0438\u043b\u043e\u0433\u0440\u0430\u043c\u043c','\u043a\u0438\u043b\u043e\u0433\u0440\u0430\u043c\u043c\u0430','\u043a\u0438\u043b\u043e\u0433\u0440\u0430\u043c\u043c\u043e\u0432'),
    '\u043c': ('\u043c\u0435\u0442\u0440','\u043c\u0435\u0442\u0440\u0430','\u043c\u0435\u0442\u0440\u043e\u0432'),
    '%': ('\u043f\u0440\u043e\u0446\u0435\u043d\u0442','\u043f\u0440\u043e\u0446\u0435\u043d\u0442\u0430','\u043f\u0440\u043e\u0446\u0435\u043d\u0442\u043e\u0432'),
    '\u00b0C': ('\u0433\u0440\u0430\u0434\u0443\u0441 \u0426\u0435\u043b\u044c\u0441\u0438\u044f','\u0433\u0440\u0430\u0434\u0443\u0441\u0430 \u0426\u0435\u043b\u044c\u0441\u0438\u044f','\u0433\u0440\u0430\u0434\u0443\u0441\u043e\u0432 \u0426\u0435\u043b\u044c\u0441\u0438\u044f'),
    '\u00b0\u0421': ('\u0433\u0440\u0430\u0434\u0443\u0441 \u0426\u0435\u043b\u044c\u0441\u0438\u044f','\u0433\u0440\u0430\u0434\u0443\u0441\u0430 \u0426\u0435\u043b\u044c\u0441\u0438\u044f','\u0433\u0440\u0430\u0434\u0443\u0441\u043e\u0432 \u0426\u0435\u043b\u044c\u0441\u0438\u044f')
}

def to_measure_ru(s: str) -> str | None:
    m = measure_re.match(s)
    if not m:
        return None
    try:
        unit_raw = m.group(1)
        # extract numeric part
        num_part = s[:m.start(1)].strip()
        t = num_part.replace('\u00A0',' ').replace(' ', '').replace('+','')
        neg = t.startswith('-'); t = t[1:] if neg else t
        is_decimal = (',' in t) or ('.' in t)
        if ',' in t: ip, fp = t.split(',', 1)
        elif '.' in t: ip, fp = t.split('.', 1)
        else: ip, fp = t, ''
        val_int = int(ip) if ip else 0
        num_sp = n2w(val_int, lang='ru') if not is_decimal else n2w(int(ip or '0'), lang='ru')
        unit_key = unit_raw
        if unit_key not in unit_forms:
            return None
        forms = unit_forms[unit_key]
        # plural by integer part only
        unit_word = plural_form(val_int, forms[0], forms[1], forms[2])
        out = f"{num_sp} {unit_word}"
        if neg:
            out = '\u043c\u0438\u043d\u0443\u0441 ' + out
        return out
    except Exception:
        return None

def build_normalized_memorizer(train_path: str, chunksize: int = 500_000):
    counts = defaultdict(Counter)
    rows = 0; chunk_idx = 0; t0 = time.time()
    for chunk in pd.read_csv(train_path, compression='zip', usecols=['before','after'], chunksize=chunksize):
        chunk_idx += 1
        bnorm = chunk['before'].map(normalize_before).values
        after = chunk['after'].values
        for b, a in zip(bnorm, after):
            counts[b][a] += 1
        rows += len(chunk)
        if chunk_idx % 5 == 0:
            print(f'  memorizer processed {rows:,} rows (chunks {chunk_idx}), unique keys: {len(counts):,}')
    mem = {b: ctr.most_common(1)[0][0] for b, ctr in counts.items()}
    print(f'  memorizer size: {len(mem):,}. build {time.time()-t0:.1f}s')
    return mem

train_path = 'ru_train.csv.zip'
test_path = 'ru_test_2.csv.zip'

print('Building normalized memorizer (for first backoff) ...')
memorized_norm = build_normalized_memorizer(train_path)

print('Loading test ...')
test = pd.read_csv(test_path, compression='zip')
test['id'] = test['sentence_id'].astype(str) + '_' + test['token_id'].astype(str)
bn = test['before'].map(normalize_before)

pred = []
t0 = time.time()
for i, (b_raw, b) in enumerate(zip(test['before'].values, bn.values)):
    out = None
    # 0) memorizer first
    if isinstance(b, str):
        out = memorized_norm.get(b, None)
        if out == '<self>':
            out = None
    # 1) DATE
    if out is None and isinstance(b, str):
        out = to_date_ru(b)
    # 2) TIME
    if out is None and isinstance(b, str):
        out = to_time_ru(b)
    # 3) MONEY
    if out is None and isinstance(b, str):
        out = to_money_ru(b)
    # 4) DECIMAL (with guards)
    if out is None and isinstance(b, str):
        if dec_re.match(b) and not ip_or_version_re.match(b) and not has_letter_re.search(b) and not url_email_re.search(b):
            out = to_decimal_ru(b)
    # 5) CARDINAL
    if out is None and isinstance(b, str) and int_re.match(b) and not url_email_re.search(b):
        out = to_int_cardinal_ru(b)
    # 6) MEASURE
    if out is None and isinstance(b, str):
        out = to_measure_ru(b)
    # 7) fallback to memorizer (if exists and not <self>), else identity
    if out is None and isinstance(b, str):
        tmp = memorized_norm.get(b, None)
        if tmp is not None and tmp != '<self>':
            out = tmp
    if out is None:
        out = b_raw
    pred.append(out)
    if (i+1) % 200000 == 0:
        print(f'  processed {i+1:,}/{len(test)} tokens...')

submission = pd.DataFrame({'id': test['id'], 'after': pred})
submission.to_csv('submission.csv', index=False)
print('Saved submission.csv with', len(submission), 'rows. Elapsed:', f'{time.time()-t0:.1f}s')
print(submission.head())

Building normalized memorizer (for first backoff) ...


  memorizer processed 2,500,000 rows (chunks 5), unique keys: 324,280


  memorizer processed 5,000,000 rows (chunks 10), unique keys: 503,997


  memorizer processed 7,500,000 rows (chunks 15), unique keys: 648,998


  memorizer processed 9,515,325 rows (chunks 20), unique keys: 751,509


  memorizer size: 751,509. build 21.9s
Loading test ...


  processed 200,000/1059191 tokens...


  processed 400,000/1059191 tokens...
  processed 600,000/1059191 tokens...


  processed 800,000/1059191 tokens...
  processed 1,000,000/1059191 tokens...


Saved submission.csv with 1059191 rows. Elapsed: 2.4s
    id        after
0  0_0       Теперь
1  0_1          все
2  0_2  уважительно
3  0_3        зовут
4  0_4      Ямамото


In [13]:
# OOF CV for v3 router: memorizer-first + DATE/TIME/MONEY/DECIMAL/CARDINAL/MEASURE
import pandas as pd, numpy as np, time
from sklearn.model_selection import GroupKFold

# Reuse helpers from cells 8/10: normalize_before, regexes, and converters: to_date_ru, to_time_ru, to_money_ru, to_decimal_ru, to_int_cardinal_ru, to_measure_ru

def build_memorizer_from_df(df: pd.DataFrame) -> dict:
    cnt = df.value_counts(['before_norm','after']).reset_index(name='cnt')
    cnt = cnt.sort_values(['before_norm','cnt'], ascending=[True, False])
    return cnt.drop_duplicates('before_norm').set_index('before_norm')['after'].to_dict()

t0 = time.time()
print('Loading full train...')
train = pd.read_csv('ru_train.csv.zip', compression='zip', usecols=['sentence_id','token_id','class','before','after'])
train['before_norm'] = train['before'].map(normalize_before)
print('Train shape:', train.shape)

gkf = GroupKFold(n_splits=5)
groups = train['sentence_id'].values
oof_pred = pd.Series(index=train.index, dtype=object)
fold_acc = []

for fold, (tr_idx, va_idx) in enumerate(gkf.split(train, groups=groups), 1):
    t_fold = time.time()
    print(f'Fold {fold}: train {len(tr_idx):,}, valid {len(va_idx):,}')
    tr = train.loc[tr_idx, ['before_norm','after']]
    mem = build_memorizer_from_df(tr)
    va_bn = train.loc[va_idx, 'before_norm'].values
    va_braw = train.loc[va_idx, 'before'].values

    preds = []
    for b_raw, b in zip(va_braw, va_bn):
        out = None
        # 0) memorizer first
        if isinstance(b, str):
            out = mem.get(b, None)
            if out == '<self>':
                out = None
        # 1) DATE
        if out is None and isinstance(b, str):
            out = to_date_ru(b)
        # 2) TIME
        if out is None and isinstance(b, str):
            out = to_time_ru(b)
        # 3) MONEY
        if out is None and isinstance(b, str):
            out = to_money_ru(b)
        # 4) DECIMAL with guards
        if out is None and isinstance(b, str):
            if dec_re.match(b) and not ip_or_version_re.match(b) and not has_letter_re.search(b) and not url_email_re.search(b):
                out = to_decimal_ru(b)
        # 5) CARDINAL
        if out is None and isinstance(b, str) and int_re.match(b) and not url_email_re.search(b):
            out = to_int_cardinal_ru(b)
        # 6) MEASURE
        if out is None and isinstance(b, str):
            out = to_measure_ru(b)
        # 7) fallback to memorizer (if exists and not <self>), else original
        if out is None and isinstance(b, str):
            tmp = mem.get(b, None)
            if tmp is not None and tmp != '<self>':
                out = tmp
        if out is None:
            out = b_raw
        preds.append(out)
    preds = pd.Series(preds, index=va_idx)
    oof_pred.loc[va_idx] = preds
    acc = (preds.values == train.loc[va_idx, 'after'].values).mean()
    fold_acc.append(acc)
    print(f'  fold {fold} acc: {acc:.6f} | elapsed {(time.time()-t_fold):.1f}s')

overall = (oof_pred.values == train['after'].values).mean()
print(f'v3 router OOF overall accuracy: {overall:.6f}')
print('Per-class OOF (top 20):')
cls_acc = train.assign(pred=oof_pred).groupby('class').apply(lambda df: (df['pred'].values == df['after'].values).mean()).sort_values(ascending=False)
print(cls_acc.head(20))
print('Done. Total elapsed:', f'{time.time()-t0:.1f}s')

Loading full train...


Train shape: (9515325, 6)


Fold 1: train 7,612,260, valid 1,903,065


  fold 1 acc: 0.973381 | elapsed 14.6s
Fold 2: train 7,612,260, valid 1,903,065


  fold 2 acc: 0.973305 | elapsed 13.7s
Fold 3: train 7,612,260, valid 1,903,065


  fold 3 acc: 0.973310 | elapsed 13.4s
Fold 4: train 7,612,260, valid 1,903,065


  fold 4 acc: 0.973584 | elapsed 13.7s
Fold 5: train 7,612,260, valid 1,903,065


  fold 5 acc: 0.973287 | elapsed 13.5s
v3 router OOF overall accuracy: 0.973373
Per-class OOF (top 20):


class
PUNCT         1.000000
VERBATIM      0.992550
PLAIN         0.984675
LETTERS       0.939792
DIGIT         0.900336
TIME          0.872244
CARDINAL      0.789590
DATE          0.789204
ORDINAL       0.550440
MEASURE       0.543861
FRACTION      0.517162
ELECTRONIC    0.418627
DECIMAL       0.353692
TELEPHONE     0.197771
MONEY         0.163062
dtype: float64
Done. Total elapsed: 91.2s


  cls_acc = train.assign(pred=oof_pred).groupby('class').apply(lambda df: (df['pred'].values == df['after'].values).mean()).sort_values(ascending=False)


In [19]:
# v4 router: add ORDINAL (safe hyphen trigger) + TELEPHONE gates; improved routing order + FRACTION + PERCENT
import re, time
import pandas as pd, numpy as np
from sklearn.model_selection import GroupKFold

# Reuse helpers from previous cells (normalization, num2words, date/time/money/decimal/measure)
from num2words import num2words as n2w

# Gates from earlier cells (reuse if defined); redefine here for isolation
has_letter_re = re.compile(r'[A-Za-z\u0410-\u042f\u0430-\u044f]')
url_email_re = re.compile(r'^[a-z0-9._%+\-]+@[a-z0-9.\-]+\.[a-z]{2,}$|^https?://|^www\.', re.I)
ip_or_version_re = re.compile(r'^\d+(?:\.\d+){2,}$')
int_re = re.compile(r'^-?\d{1,3}(?:[ \u00A0]?\d{3})*$|^-?\d+$')
dec_re = re.compile(r'^-?\d+[\.,]\d+$')
date_ddmmyyyy_re = re.compile(r'^(0[1-9]|[12]\d|3[01])\.(0[1-9]|1[0-2])\.(\d{4})$')
fraction_re = re.compile(r'^(\d+)/(\d+)$')
percent_re = re.compile(r'^-?\d+(?:[ \u00A0]?\d{3})*(?:[\.,]\d+)?\s*%$')

# TELEPHONE gates (conservative) before numeric generators
tel_re_generic = re.compile(r'^\+?\d[\d\-\(\) \u00A0]{6,}\d$')
tel_re_ru_1 = re.compile(r'^\+7\d{10}$')
tel_re_ru_2 = re.compile(r'^8\d{10}$')

digit_words = ['\u043d\u043e\u043b\u044c','\u043e\u0434\u0438\u043d','\u0434\u0432\u0430','\u0442\u0440\u0438','\u0447\u0435\u0442\u044b\u0440\u0435','\u043f\u044f\u0442\u044c','\u0448\u0435\u0441\u0442\u044c','\u0441\u0435\u043c\u044c','\u0432\u043e\u0441\u0435\u043c\u044c','\u0434\u0435\u0432\u044f\u0442\u044c']
def speak_digits_ru(s: str) -> str:
    out = []
    lead_plus = s.strip().startswith('+')
    for ch in s:
        if ch.isdigit():
            out.append(digit_words[int(ch)])
    if lead_plus:
        return '\u043f\u043b\u044e\u0441 ' + ' '.join(out)
    return ' '.join(out)

# ORDINAL safe trigger: hyphen/dotted suffixes only
ordinal_re = re.compile(r'^(\d+)-?(\u0439|\u044b\u0439|\u0438\u0439|\u044f|\u0435|\u043e\u0435|\u0430\u044f|\u0433\u043e|\u043e\u0433\u043e|\u0435\u043c\u0443|\u043e\u043c\u0443|\u044b\u043c|\u0438\u043c|\u043e\u043c|\u043e\u0439|\u0443\u044e|\u043c\u0438|\u044b\u043c\u0438|\u044b\u0445|\u0438\u0445)\.?$', re.I)

def morph_ordinal_phrase(base_nom_masc: str, suffix: str) -> str:
    # Convert last token according to Russian case/gender implied by suffix
    parts = base_nom_masc.split()
    if not parts:
        return base_nom_masc
    last = parts[-1]
    suf = suffix.lower()
    def to_gen(word):
        if word.endswith('\u044b\u0439') or word.endswith('\u0438\u0439') or word.endswith('\u043e\u0439'):
            return word[:-2] + '\u043e\u0433\u043e'
        return word
    def to_prepdat(word):
        if word.endswith('\u044b\u0439') or word.endswith('\u0438\u0439') or word.endswith('\u043e\u0439'):
            return word[:-2] + '\u043e\u043c\u0443'
        return word
    def to_instr(word):
        if word.endswith('\u044b\u0439') or word.endswith('\u0438\u0439') or word.endswith('\u043e\u0439'):
            return word[:-2] + '\u044b\u043c'
        return word
    def to_fem_nom(word):
        if word.endswith('\u044b\u0439') or word.endswith('\u0438\u0439') or word.endswith('\u043e\u0439'):
            return word[:-2] + '\u0430\u044f'
        return word
    def to_neut_nom(word):
        if word.endswith('\u044b\u0439') or word.endswith('\u0438\u0439') or word.endswith('\u043e\u0439'):
            return word[:-2] + '\u043e\u0435'
        return word
    # Map suffix to target transformation
    if suf in ('\u0439','\u044b\u0439','\u0438\u0439'):
        parts[-1] = last
    elif suf in ('\u0433\u043e','\u043e\u0433\u043e','\u0435\u0433\u043e'):
        parts[-1] = to_gen(last)
    elif suf in ('\u043e\u043c\u0443','\u0435\u043c\u0443'):
        parts[-1] = to_prepdat(last)
    elif suf in ('\u044b\u043c','\u0438\u043c'):
        parts[-1] = to_instr(last)
    elif suf in ('\u044f','\u0430\u044f'):
        parts[-1] = to_fem_nom(last)
    elif suf in ('\u0435','\u043e\u0435','\u0435\u0435'):
        parts[-1] = to_neut_nom(last)
    elif suf in ('\u043e\u0439','\u0435\u0439'):
        parts[-1] = to_gen(last)  # gen/prep merged simplification
    elif suf == '\u0443\u044e':
        parts[-1] = to_fem_nom(last)  # accusative fem; approx
    # Plural/other endings: leave base (simplified, safe)
    return ' '.join(parts)

def to_ordinal_ru(token: str) -> str | None:
    m = ordinal_re.match(token)
    if not m:
        return None
    try:
        n = int(m.group(1))
        suf = m.group(2)
        base = n2w(n, lang='ru', to='ordinal')  # masculine nominative phrase
        phr = morph_ordinal_phrase(base, suf)
        return phr
    except Exception:
        return None

# FRACTION handler
def plural_form(n, one, two_four, many):
    try:
        n = abs(int(n))
    except Exception:
        return many
    if 11 <= n % 100 <= 14:
        return many
    if n % 10 == 1:
        return one
    if n % 10 in (2,3,4):
        return two_four
    return many

denom_nom_fem = {2:'\u0432\u0442\u043e\u0440\u0430\u044f',3:'\u0442\u0440\u0435\u0442\u044c\u044f',4:'\u0447\u0435\u0442\u0432\u0435\u0440\u0442\u0430\u044f',5:'\u043f\u044f\u0442\u0430\u044f',6:'\u0448\u0435\u0441\u0442\u0430\u044f',7:'\u0441\u0435\u0434\u044c\u043c\u0430\u044f',8:'\u0432\u043e\u0441\u044c\u043c\u0430\u044f',9:'\u0434\u0435\u0432\u044f\u0442\u0430\u044f',10:'\u0434\u0435\u0441\u044f\u0442\u0430\u044f'}
denom_gen_pl = {2:'\u0432\u0442\u043e\u0440\u044b\u0445',3:'\u0442\u0440\u0435\u0442\u044c\u0438\u0445',4:'\u0447\u0435\u0442\u0432\u0435\u0440\u0442\u044b\u0445',5:'\u043f\u044f\u0442\u044b\u0445',6:'\u0448\u0435\u0441\u0442\u044b\u0445',7:'\u0441\u0435\u0434\u044c\u043c\u044b\u0445',8:'\u0432\u043e\u0441\u044c\u043c\u044b\u0445',9:'\u0434\u0435\u0432\u044f\u0442\u044b\u0445',10:'\u0434\u0435\u0441\u044f\u0442\u044b\u0445'}

def to_fraction_ru(token: str) -> str | None:
    m = fraction_re.match(token)
    if not m:
        return None
    a = int(m.group(1)); b = int(m.group(2))
    if b == 0:
        return None
    try:
        if a == 1:
            if b in denom_nom_fem:
                num_sp = '\u043e\u0434\u043d\u0430'
                return f"{num_sp} {denom_nom_fem[b]}"
            # fallback generic
            return f"\u043e\u0434\u043d\u0430 {n2w(b, lang='ru', to='ordinal')}"
        # plural genitive pattern: '\u0434\u0432\u0435/\u0442\u0440\u0438/\u0447\u0435\u0442\u044b\u0440\u0435/\u043f\u044f\u0442\u044c ...' + denom gen pl
        a_sp = n2w(a, lang='ru')
        if a % 10 == 2 and a % 100 != 12:
            a_sp = '\u0434\u0432\u0435' if a == 2 else a_sp
        denom = denom_gen_pl.get(b, None)
        if denom is None:
            denom = n2w(b, lang='ru', to='ordinal')
            # crude genitive plural by replacing masculine endings
            if denom.endswith('\u044b\u0439') or denom.endswith('\u0438\u0439') or denom.endswith('\u043e\u0439'):
                denom = denom[:-2] + '\u044b\u0445'
        return f"{a_sp} {denom}"
    except Exception:
        return None

# PERCENT handler (before DECIMAL/MEASURE)
def to_percent_ru(token: str) -> str | None:
    if not percent_re.match(token):
        return None
    try:
        t = token.replace('\u00A0',' ').strip()
        t = t[:-1].strip()  # remove %
        neg = t.startswith('-')
        t = t.lstrip('+')
        if t.startswith('-'):
            t = t[1:]
        # split integer and fractional by comma/dot
        if ',' in t:
            ip, fp = t.split(',', 1)
        elif '.' in t:
            ip, fp = t.split('.', 1)
        else:
            ip, fp = t, ''
        # integer part spoken
        ip_sp = n2w(int(ip) if ip else 0, lang='ru')
        if fp:
            frac_digits = ' '.join(digit_words[int(ch)] for ch in fp if ch.isdigit())
            core = f"{ip_sp} \u0437\u0430\u043f\u044f\u0442\u0430\u044f {frac_digits}"
        else:
            core = ip_sp
        # plural for 'процент'
        try:
            ip_val = abs(int(ip)) if ip else 0
        except Exception:
            ip_val = 0
        unit = plural_form(ip_val, '\u043f\u0440\u043e\u0446\u0435\u043d\u0442', '\u043f\u0440\u043e\u0446\u0435\u043d\u0442\u0430', '\u043f\u0440\u043e\u0446\u0435\u043d\u0442\u043e\u0432')
        out = f"{core} {unit}"
        if neg:
            out = '\u043c\u0438\u043d\u0443\u0441 ' + out
        return out
    except Exception:
        return None

def build_memorizer_from_df(df: pd.DataFrame) -> dict:
    cnt = df.value_counts(['before_norm','after']).reset_index(name='cnt')
    cnt = cnt.sort_values(['before_norm','cnt'], ascending=[True, False])
    return cnt.drop_duplicates('before_norm').set_index('before_norm')['after'].to_dict()

print('Loading full train...')
train = pd.read_csv('ru_train.csv.zip', compression='zip', usecols=['sentence_id','token_id','class','before','after'])
train['before_norm'] = train['before'].map(normalize_before)
print('Train shape:', train.shape)

gkf = GroupKFold(n_splits=5)
groups = train['sentence_id'].values
oof_pred = pd.Series(index=train.index, dtype=object)
fold_acc = []
t0 = time.time()
for fold, (tr_idx, va_idx) in enumerate(gkf.split(train, groups=groups), 1):
    tf = time.time()
    print(f'Fold {fold}: train {len(tr_idx):,}, valid {len(va_idx):,}')
    tr = train.loc[tr_idx, ['before_norm','after']]
    mem = build_memorizer_from_df(tr)
    va_bn = train.loc[va_idx, 'before_norm'].values
    va_braw = train.loc[va_idx, 'before'].values

    preds = []
    for b_raw, b in zip(va_braw, va_bn):
        out = None
        if isinstance(b, str):
            # 0) Memorizer first (skip '<self>')
            out = mem.get(b, None)
            if out == '<self>':
                out = None
            # 1) ELECTRONIC gate: if URL/email, leave to memorizer/identity; prevents numeric misfires
            if out is None and url_email_re.search(b):
                # try mem again
                tmp = mem.get(b, None)
                out = tmp if (tmp is not None and tmp != '<self>') else b_raw
            # 2) TELEPHONE
            if out is None and (tel_re_generic.match(b) or tel_re_ru_1.match(b) or tel_re_ru_2.match(b)):
                out = speak_digits_ru(b)
            # 3) DATE
            if out is None:
                out = to_date_ru(b)
            # 4) ORDINAL (safe)
            if out is None:
                out = to_ordinal_ru(b)
            # 5) TIME
            if out is None:
                out = to_time_ru(b)
            # 6) MONEY
            if out is None:
                out = to_money_ru(b)
            # 7) FRACTION
            if out is None:
                out = to_fraction_ru(b)
            # 8) PERCENT
            if out is None and percent_re.match(b):
                out = to_percent_ru(b)
            # 9) DECIMAL with guards
            if out is None and dec_re.match(b) and not ip_or_version_re.match(b) and not has_letter_re.search(b) and not url_email_re.search(b) and not date_ddmmyyyy_re.match(b):
                out = to_decimal_ru(b)
            # 10) MEASURE
            if out is None:
                out = to_measure_ru(b)
            # 11) CARDINAL
            if out is None and int_re.match(b) and not url_email_re.search(b):
                out = to_int_cardinal_ru(b)
            # 12) fallback memorizer, else identity
            if out is None:
                tmp = mem.get(b, None)
                if tmp is not None and tmp != '<self>':
                    out = tmp
        if out is None:
            out = b_raw
        preds.append(out)
    preds = pd.Series(preds, index=va_idx)
    oof_pred.loc[va_idx] = preds
    acc = (preds.values == train.loc[va_idx, 'after'].values).mean()
    fold_acc.append(acc)
    print(f'  fold {fold} acc: {acc:.6f} | elapsed {time.time()-tf:.1f}s')

overall = (oof_pred.values == train['after'].values).mean()
print(f'v4 router OOF overall accuracy: {overall:.6f}')
print('Per-class OOF (top 20):')
cls_acc = train.assign(pred=oof_pred).groupby('class').apply(lambda df: (df['pred'].values == df['after'].values).mean()).sort_values(ascending=False)
print(cls_acc.head(20))
print('Done. Total elapsed:', f'{time.time()-t0:.1f}s')

Loading full train...


Train shape: (9515325, 6)


Fold 1: train 7,612,260, valid 1,903,065


  fold 1 acc: 0.973432 | elapsed 15.6s
Fold 2: train 7,612,260, valid 1,903,065


  fold 2 acc: 0.973348 | elapsed 14.7s
Fold 3: train 7,612,260, valid 1,903,065


  fold 3 acc: 0.973371 | elapsed 14.4s
Fold 4: train 7,612,260, valid 1,903,065


  fold 4 acc: 0.973593 | elapsed 14.5s
Fold 5: train 7,612,260, valid 1,903,065


  fold 5 acc: 0.973323 | elapsed 14.5s
v4 router OOF overall accuracy: 0.973414
Per-class OOF (top 20):


class
PUNCT         1.000000
VERBATIM      0.992550
PLAIN         0.984675
LETTERS       0.939792
DIGIT         0.900336
TIME          0.872244
DATE          0.789049
CARDINAL      0.787920
FRACTION      0.631121
ORDINAL       0.563201
MEASURE       0.542981
ELECTRONIC    0.418627
DECIMAL       0.353692
TELEPHONE     0.204724
MONEY         0.163062
dtype: float64
Done. Total elapsed: 83.0s


  cls_acc = train.assign(pred=oof_pred).groupby('class').apply(lambda df: (df['pred'].values == df['after'].values).mean()).sort_values(ascending=False)


In [20]:
# Inference v4: add ORDINAL + TELEPHONE gates; improved routing order; memorizer-first (+FRACTION + PERCENT)
import re, time, pandas as pd
from collections import defaultdict, Counter
from num2words import num2words as n2w

# Gates/regex (match v4 OOF cell)
has_letter_re = re.compile(r'[A-Za-z\u0410-\u042f\u0430-\u044f]')
url_email_re = re.compile(r'^[a-z0-9._%+\-]+@[a-z0-9.\-]+\.[a-z]{2,}$|^https?://|^www\.', re.I)
ip_or_version_re = re.compile(r'^\d+(?:\.\d+){2,}$')
int_re = re.compile(r'^-?\d{1,3}(?:[ \u00A0]?\d{3})*$|^-?\d+$')
dec_re = re.compile(r'^-?\d+[\.,]\d+$')
date_ddmmyyyy_re = re.compile(r'^(0[1-9]|[12]\d|3[01])\.(0[1-9]|1[0-2])\.(\d{4})$')
fraction_re = re.compile(r'^(\d+)/(\d+)$')
percent_re = re.compile(r'^-?\d+(?:[ \u00A0]?\d{3})*(?:[\.,]\d+)?\s*%$')

# TELEPHONE gates
tel_re_generic = re.compile(r'^\+?\d[\d\-\(\) \u00A0]{6,}\d$')
tel_re_ru_1 = re.compile(r'^\+7\d{10}$')
tel_re_ru_2 = re.compile(r'^8\d{10}$')

digit_words = ['\u043d\u043e\u043b\u044c','\u043e\u0434\u0438\u043d','\u0434\u0432\u0430','\u0442\u0440\u0438','\u0447\u0435\u0442\u044b\u0440\u0435','\u043f\u044f\u0442\u044c','\u0448\u0435\u0441\u0442\u044c','\u0441\u0435\u043c\u044c','\u0432\u043e\u0441\u0435\u043c\u044c','\u0434\u0435\u0432\u044f\u0442\u044c']
def speak_digits_ru(s: str) -> str:
    out = []
    lead_plus = s.strip().startswith('+')
    for ch in s:
        if ch.isdigit():
            out.append(digit_words[int(ch)])
    if lead_plus:
        return '\u043f\u043b\u044e\u0441 ' + ' '.join(out)
    return ' '.join(out)

# ORDINAL safe trigger
ordinal_re = re.compile(r'^(\d+)-?(\u0439|\u044b\u0439|\u0438\u0439|\u044f|\u0435|\u043e\u0435|\u0430\u044f|\u0433\u043e|\u043e\u0433\u043e|\u0435\u043c\u0443|\u043e\u043c\u0443|\u044b\u043c|\u0438\u043c|\u043e\u043c|\u043e\u0439|\u0443\u044e|\u043c\u0438|\u044b\u043c\u0438|\u044b\u0445|\u0438\u0445)\.?$', re.I)
def morph_ordinal_phrase(base_nom_masc: str, suffix: str) -> str:
    parts = base_nom_masc.split()
    if not parts:
        return base_nom_masc
    last = parts[-1]
    suf = suffix.lower()
    def to_gen(word):
        if word.endswith('\u044b\u0439') or word.endswith('\u0438\u0439') or word.endswith('\u043e\u0439'):
            return word[:-2] + '\u043e\u0433\u043e'
        return word
    def to_prepdat(word):
        if word.endswith('\u044b\u0439') or word.endswith('\u0438\u0439') or word.endswith('\u043e\u0439'):
            return word[:-2] + '\u043e\u043c\u0443'
        return word
    def to_instr(word):
        if word.endswith('\u044b\u0439') or word.endswith('\u0438\u0439') or word.endswith('\u043e\u0439'):
            return word[:-2] + '\u044b\u043c'
        return word
    def to_fem_nom(word):
        if word.endswith('\u044b\u0439') or word.endswith('\u0438\u0439') or word.endswith('\u043e\u0439'):
            return word[:-2] + '\u0430\u044f'
        return word
    def to_neut_nom(word):
        if word.endswith('\u044b\u0439') or word.endswith('\u0438\u0439') or word.endswith('\u043e\u0439'):
            return word[:-2] + '\u043e\u0435'
        return word
    if suf in ('\u0439','\u044b\u0439','\u0438\u0439'):
        parts[-1] = last
    elif suf in ('\u0433\u043e','\u043e\u0433\u043e','\u0435\u0433\u043e'):
        parts[-1] = to_gen(last)
    elif suf in ('\u043e\u043c\u0443','\u0435\u043c\u0443'):
        parts[-1] = to_prepdat(last)
    elif suf in ('\u044b\u043c','\u0438\u043c'):
        parts[-1] = to_instr(last)
    elif suf in ('\u044f','\u0430\u044f'):
        parts[-1] = to_fem_nom(last)
    elif suf in ('\u0435','\u043e\u0435','\u0435\u0435'):
        parts[-1] = to_neut_nom(last)
    elif suf in ('\u043e\u0439','\u0435\u0439'):
        parts[-1] = to_gen(last)
    elif suf == '\u0443\u044e':
        parts[-1] = to_fem_nom(last)
    return ' '.join(parts)

def to_ordinal_ru(token: str) -> str | None:
    m = ordinal_re.match(token)
    if not m:
        return None
    try:
        n = int(m.group(1)); suf = m.group(2)
        base = n2w(n, lang='ru', to='ordinal')
        return morph_ordinal_phrase(base, suf)
    except Exception:
        return None

# FRACTION and PERCENT helpers
def plural_form(n, one, two_four, many):
    try:
        n = abs(int(n))
    except Exception:
        return many
    if 11 <= n % 100 <= 14:
        return many
    if n % 10 == 1:
        return one
    if n % 10 in (2,3,4):
        return two_four
    return many

denom_nom_fem = {2:'\u0432\u0442\u043e\u0440\u0430\u044f',3:'\u0442\u0440\u0435\u0442\u044c\u044f',4:'\u0447\u0435\u0442\u0432\u0435\u0440\u0442\u0430\u044f',5:'\u043f\u044f\u0442\u0430\u044f',6:'\u0448\u0435\u0441\u0442\u0430\u044f',7:'\u0441\u0435\u0434\u044c\u043c\u0430\u044f',8:'\u0432\u043e\u0441\u044c\u043c\u0430\u044f',9:'\u0434\u0435\u0432\u044f\u0442\u0430\u044f',10:'\u0434\u0435\u0441\u044f\u0442\u0430\u044f'}
denom_gen_pl = {2:'\u0432\u0442\u043e\u0440\u044b\u0445',3:'\u0442\u0440\u0435\u0442\u044c\u0438\u0445',4:'\u0447\u0435\u0442\u0432\u0435\u0440\u0442\u044b\u0445',5:'\u043f\u044f\u0442\u044b\u0445',6:'\u0448\u0435\u0441\u0442\u044b\u0445',7:'\u0441\u0435\u0434\u044c\u043c\u044b\u0445',8:'\u0432\u043e\u0441\u044c\u043c\u044b\u0445',9:'\u0434\u0435\u0432\u044f\u0442\u044b\u0445',10:'\u0434\u0435\u0441\u044f\u0442\u044b\u0445'}

def to_fraction_ru(token: str) -> str | None:
    m = fraction_re.match(token)
    if not m:
        return None
    a = int(m.group(1)); b = int(m.group(2))
    if b == 0:
        return None
    try:
        if a == 1:
            if b in denom_nom_fem:
                num_sp = '\u043e\u0434\u043d\u0430'
                return f"{num_sp} {denom_nom_fem[b]}"
            return f"\u043e\u0434\u043d\u0430 {n2w(b, lang='ru', to='ordinal')}"
        a_sp = n2w(a, lang='ru')
        if a % 10 == 2 and a % 100 != 12:
            a_sp = '\u0434\u0432\u0435' if a == 2 else a_sp
        denom = denom_gen_pl.get(b, None)
        if denom is None:
            denom = n2w(b, lang='ru', to='ordinal')
            if denom.endswith('\u044b\u0439') or denom.endswith('\u0438\u0439') or denom.endswith('\u043e\u0439'):
                denom = denom[:-2] + '\u044b\u0445'
        return f"{a_sp} {denom}"
    except Exception:
        return None

def to_percent_ru(token: str) -> str | None:
    if not percent_re.match(token):
        return None
    try:
        t = token.replace('\u00A0',' ').strip()
        t = t[:-1].strip()
        neg = t.startswith('-')
        t = t.lstrip('+')
        if t.startswith('-'):
            t = t[1:]
        if ',' in t:
            ip, fp = t.split(',', 1)
        elif '.' in t:
            ip, fp = t.split('.', 1)
        else:
            ip, fp = t, ''
        ip_sp = n2w(int(ip) if ip else 0, lang='ru')
        if fp:
            frac_digits = ' '.join(digit_words[int(ch)] for ch in fp if ch.isdigit())
            core = f"{ip_sp} \u0437\u0430\u043f\u044f\u0442\u0430\u044f {frac_digits}"
        else:
            core = ip_sp
        try:
            ip_val = abs(int(ip)) if ip else 0
        except Exception:
            ip_val = 0
        unit = plural_form(ip_val, '\u043f\u0440\u043e\u0446\u0435\u043d\u0442', '\u043f\u0440\u043e\u0446\u0435\u043d\u0442\u0430', '\u043f\u0440\u043e\u0446\u0435\u043d\u0442\u043e\u0432')
        out = f"{core} {unit}"
        if neg:
            out = '\u043c\u0438\u043d\u0443\u0441 ' + out
        return out
    except Exception:
        return None

def build_normalized_memorizer(train_path: str, chunksize: int = 500_000):
    counts = defaultdict(Counter)
    rows = 0; chunk_idx = 0; t0 = time.time()
    for chunk in pd.read_csv(train_path, compression='zip', usecols=['before','after'], chunksize=chunksize):
        chunk_idx += 1
        bnorm = chunk['before'].map(normalize_before).values
        after = chunk['after'].values
        for b, a in zip(bnorm, after):
            counts[b][a] += 1
        rows += len(chunk)
        if chunk_idx % 5 == 0:
            print(f'  memorizer processed {rows:,} rows (chunks {chunk_idx}), unique keys: {len(counts):,}')
    mem = {b: ctr.most_common(1)[0][0] for b, ctr in counts.items()}
    print(f'  memorizer size: {len(mem):,}. build {time.time()-t0:.1f}s')
    return mem

train_path = 'ru_train.csv.zip'
test_path = 'ru_test_2.csv.zip'

print('Building normalized memorizer (for first backoff) ...')
memorized_norm = build_normalized_memorizer(train_path)

print('Loading test ...')
test = pd.read_csv(test_path, compression='zip')
test['id'] = test['sentence_id'].astype(str) + '_' + test['token_id'].astype(str)
bn = test['before'].map(normalize_before)

pred = []
t0 = time.time()
for i, (b_raw, b) in enumerate(zip(test['before'].values, bn.values)):
    out = None
    if isinstance(b, str):
        # 0) Memorizer first
        out = memorized_norm.get(b, None)
        if out == '<self>':
            out = None
        # 1) ELECTRONIC early gate
        if out is None and url_email_re.search(b):
            tmp = memorized_norm.get(b, None)
            out = tmp if (tmp is not None and tmp != '<self>') else b_raw
        # 2) TELEPHONE
        if out is None and (tel_re_generic.match(b) or tel_re_ru_1.match(b) or tel_re_ru_2.match(b)):
            out = speak_digits_ru(b)
        # 3) DATE
        if out is None:
            out = to_date_ru(b)
        # 4) ORDINAL
        if out is None:
            out = to_ordinal_ru(b)
        # 5) TIME
        if out is None:
            out = to_time_ru(b)
        # 6) MONEY
        if out is None:
            out = to_money_ru(b)
        # 7) FRACTION
        if out is None:
            out = to_fraction_ru(b)
        # 8) PERCENT
        if out is None and percent_re.match(b):
            out = to_percent_ru(b)
        # 9) DECIMAL with guards
        if out is None and dec_re.match(b) and not ip_or_version_re.match(b) and not has_letter_re.search(b) and not url_email_re.search(b) and not date_ddmmyyyy_re.match(b):
            out = to_decimal_ru(b)
        # 10) MEASURE
        if out is None:
            out = to_measure_ru(b)
        # 11) CARDINAL
        if out is None and int_re.match(b) and not url_email_re.search(b):
            out = to_int_cardinal_ru(b)
        # 12) Fallback memorizer -> identity
        if out is None:
            tmp = memorized_norm.get(b, None)
            if tmp is not None and tmp != '<self>':
                out = tmp
    if out is None:
        out = b_raw
    pred.append(out)
    if (i+1) % 200000 == 0:
        print(f'  processed {i+1:,}/{len(test)} tokens...')

submission = pd.DataFrame({'id': test['id'], 'after': pred})
submission.to_csv('submission.csv', index=False)
print('Saved submission.csv with', len(submission), 'rows. Elapsed:', f'{time.time()-t0:.1f}s')
print(submission.head())

Building normalized memorizer (for first backoff) ...


  memorizer processed 5,000,000 rows (chunks 10), unique keys: 503,997


  memorizer processed 7,500,000 rows (chunks 15), unique keys: 648,998


  memorizer processed 9,515,325 rows (chunks 20), unique keys: 751,509


  memorizer size: 751,509. build 22.3s
Loading test ...


  processed 200,000/1059191 tokens...


  processed 400,000/1059191 tokens...


  processed 600,000/1059191 tokens...


  processed 800,000/1059191 tokens...


  processed 1,000,000/1059191 tokens...


Saved submission.csv with 1059191 rows. Elapsed: 2.6s
    id        after
0  0_0       Теперь
1  0_1          все
2  0_2  уважительно
3  0_3        зовут
4  0_4      Ямамото


In [17]:
# v5: Enhanced DATE rules per expert advice (bare year, dd.mm, ranges, centuries, Roman months)
import re
from num2words import num2words as n2w

# Reuse months/day maps from earlier cells if present; redefine to be safe
months_gen = ['января','февраля','марта','апреля','мая','июня','июля','августа','сентября','октября','ноября','декабря']
day_ord_gen = {
 1:'первого',2:'второго',3:'третьего',4:'четвертого',5:'пятого',6:'шестого',7:'седьмого',8:'восьмого',9:'девятого',10:'десятого',
 11:'одиннадцатого',12:'двенадцатого',13:'тринадцатого',14:'четырнадцатого',15:'пятнадцатого',16:'шестнадцатого',17:'семнадцатого',18:'восемнадцатого',19:'девятнадцатого',
 20:'двадцатого',21:'двадцать первого',22:'двадцать второго',23:'двадцать третьего',24:'двадцать четвертого',25:'двадцать пятого',26:'двадцать шестого',27:'двадцать седьмого',28:'двадцать восьмого',29:'двадцать девятого',
 30:'тридцатого',31:'тридцать первого'
}

def masc_nom_to_gen(word: str) -> str:
    if word.endswith('ый') or word.endswith('ий') or word.endswith('ой'):
        return word[:-2] + 'ого'
    return word
def ordinal_phrase_nom_to_gen(phrase: str) -> str:
    parts = phrase.split()
    if not parts:
        return phrase
    parts[-1] = masc_nom_to_gen(parts[-1])
    return ' '.join(parts)

def year_to_ord_gen(y: int) -> str | None:
    try:
        if 1000 <= y <= 1999:
            rest = y - 1000
            if rest == 0:
                return None
            rest_ord = n2w(rest, lang='ru', to='ordinal')
            rest_ord_gen = ordinal_phrase_nom_to_gen(rest_ord)
            return f'тысяча {rest_ord_gen}'
        if y == 2000:
            return 'двухтысячного'
        if 2001 <= y <= 2099:
            rest = y - 2000
            rest_ord = n2w(rest, lang='ru', to='ordinal')
            rest_ord_gen = ordinal_phrase_nom_to_gen(rest_ord)
            return f'две тысячи {rest_ord_gen}'
        return None
    except Exception:
        return None

# Regexes
date_ddmmyyyy_re = re.compile(r'^(0[1-9]|[12]\d|3[01])\.(0[1-9]|1[0-2])\.(\d{4})$')
date_d_month_yyyy_re = re.compile(r'^([1-9]|[12]\d|3[01])\s+(января|февраля|марта|апреля|мая|июня|июля|августа|сентября|октября|ноября|декабря)\s+(\d{4})$', re.I)
date_ddmm_re = re.compile(r'^(0[1-9]|[12]\d|3[01])\.(0[1-9]|1[0-2])$')
year_word_re = re.compile(r'^(\d{4})\s*(г\.|год)$', re.I)
bare_year_re = re.compile(r'^(19\d{2}|20\d{2})$')
range_years_re = re.compile(r'^(\d{4})\s*[–—-]\s*(\d{4})\s*(гг?\.|года)?$', re.I)
roman_century_re = re.compile(r'^(X|XX|XXX|I|II|III|IV|V|VI|VII|VIII|IX)\s*(век|в\.)$', re.I)
date_dd_Roman_yyyy_re = re.compile(r'^([1-9]|[12]\d|3[01])\.(I|II|III|IV|V|VI|VII|VIII|IX|X|XI|XII)\.(\d{4})$', re.I)

roman_to_int = {
    'I':1,'II':2,'III':3,'IV':4,'V':5,'VI':6,'VII':7,'VIII':8,'IX':9,'X':10,'XI':11,'XII':12
}
def to_century_ordinal_ru(n: int) -> str:
    try:
        return n2w(n, lang='ru', to='ordinal')
    except Exception:
        return ''

def to_date_ru(s: str) -> str | None:
    # dd.mm.yyyy
    m = date_ddmmyyyy_re.match(s)
    if m:
        d = int(m.group(1)); mo = int(m.group(2)); y = int(m.group(3))
        d_sp = day_ord_gen.get(d); y_sp = year_to_ord_gen(y)
        if d_sp and y_sp:
            return f"{d_sp} {months_gen[mo-1]} {y_sp} года"
        return None
    # d month yyyy
    m = date_d_month_yyyy_re.match(s)
    if m:
        d = int(m.group(1)); month_word = m.group(2).lower(); y = int(m.group(3))
        d_sp = day_ord_gen.get(d); y_sp = year_to_ord_gen(y)
        if d_sp and y_sp and month_word in months_gen:
            return f"{d_sp} {month_word} {y_sp} года"
        return None
    # dd.mm (no year)
    m = date_ddmm_re.match(s)
    if m:
        d = int(m.group(1)); mo = int(m.group(2))
        d_sp = day_ord_gen.get(d)
        if d_sp:
            return f"{d_sp} {months_gen[mo-1]}"
        return None
    # yyyy + word (год/г.) -> nominative 'год'
    m = year_word_re.match(s)
    if m:
        y = int(m.group(1))
        y_sp = year_to_ord_gen(y)
        if y_sp:
            return f"{y_sp} год"
        return None
    # bare year -> genitive 'года'
    m = bare_year_re.match(s)
    if m:
        y = int(m.group(1))
        y_sp = year_to_ord_gen(y)
        if y_sp:
            return f"{y_sp} года"
        return None
    # ranges yyyy—yyyy
    m = range_years_re.match(s)
    if m:
        y1 = int(m.group(1)); y2 = int(m.group(2))
        y1_sp = year_to_ord_gen(y1); y2_sp = year_to_ord_gen(y2)
        if y1_sp and y2_sp:
            return f"{y1_sp} — {y2_sp} годы"
        return None
    # Roman century
    m = roman_century_re.match(s)
    if m:
        rc = m.group(1).upper()
        # Map Roman to Arabic century number
        roman_cent_map = {'I':1,'II':2,'III':3,'IV':4,'V':5,'VI':6,'VII':7,'VIII':8,'IX':9,'X':10,'XX':20,'XXX':30}
        n = roman_cent_map.get(rc, None)
        if n is None:
            # Compose if like XIX, XXI etc. Basic support: try to parse by replacement
            try:
                # crude: convert using known numerals
                vals = {'I':1,'V':5,'X':10}
                total = 0
                prev = 0
                for ch in rc[::-1]:
                    v = vals[ch]
                    if v < prev: total -= v
                    else: total += v; prev = v
                n = total
            except Exception:
                n = None
        if n:
            ord_cent = to_century_ordinal_ru(n)
            if ord_cent:
                return f"{ord_cent} век"
        return None
    # dd.Roman.yyyy
    m = date_dd_Roman_yyyy_re.match(s)
    if m:
        d = int(m.group(1)); rm = m.group(2).upper(); y = int(m.group(3))
        mo = roman_to_int.get(rm, None)
        if mo:
            d_sp = day_ord_gen.get(d); y_sp = year_to_ord_gen(y)
            if d_sp and y_sp:
                return f"{d_sp} {months_gen[mo-1]} {y_sp} года"
    return None

In [22]:
# v5 router OOF: add context-aware memorizer (tri/bi/unigram), enhanced MONEY/DECIMAL/MEASURE, routing per expert
import re, time, pandas as pd, numpy as np
from sklearn.model_selection import GroupKFold
from collections import defaultdict, Counter
from num2words import num2words as n2w

# Reuse normalize_before, to_date_ru, to_time_ru, to_ordinal_ru, to_fraction_ru, to_percent_ru, to_int_cardinal_ru if defined
try:
    normalize_before
except NameError:
    import unicodedata
    def normalize_before(x: str) -> str:
        if not isinstance(x, str):
            return x
        x = unicodedata.normalize('NFKC', x)
        x = x.replace('\u00A0', ' ').replace('\u2009', ' ')
        return x

digit_words = ['\u043d\u043e\u043b\u044c','\u043e\u0434\u0438\u043d','\u0434\u0432\u0430','\u0442\u0440\u0438','\u0447\u0435\u0442\u044b\u0440\u0435','\u043f\u044f\u0442\u044c','\u0448\u0435\u0441\u0442\u044c','\u0441\u0435\u043c\u044c','\u0432\u043e\u0441\u0435\u043c\u044c','\u0434\u0435\u0432\u044f\u0442\u044c']

# Gates/regex
has_letter_re = re.compile(r'[A-Za-z\u0410-\u042f\u0430-\u044f]')
url_email_re = re.compile(r'^[a-z0-9._%+\-]+@[a-z0-9.\-]+\.[a-z]{2,}$|^https?://|^www\.', re.I)
ip_or_version_re = re.compile(r'^\d+(?:\.\d+){2,}$')
int_re = re.compile(r'^[-+]?\d{1,3}(?:[ \u00A0]?\d{3})*$|^[-+]?\d+$')
date_ddmmyyyy_any_re = re.compile(r'^(0?[1-9]|[12]\d|3[01])\.(0?[1-9]|1[0-2])\.(\d{2,4})$')
time_like_re = re.compile(r'^\d{1,2}[:.]\d{2}$')
sci_re = re.compile(r'^[\d.,]*e[-+]?\d+$', re.I)

# DECIMAL deny/accept
decimal_accept_1 = re.compile(r'^[-+]?(?:\d+|\d{1,3}(?:[ \u00A0]?\d{3})*)[.,]\d+$')
decimal_accept_2 = re.compile(r'^[-+]?\.\d+$')

# TELEPHONE (reuse from v4)
tel_re_generic = re.compile(r'^\+?\d[\d\-\(\) \u00A0]{6,}\d$')
tel_re_ru_1 = re.compile(r'^\+7\d{10}$')
tel_re_ru_2 = re.compile(r'^8\d{10}$')
def speak_digits_ru(s: str) -> str:
    out = []; lead_plus = s.strip().startswith('+')
    for ch in s:
        if ch.isdigit(): out.append(digit_words[int(ch)])
    return ('\u043f\u043b\u044e\u0441 ' if lead_plus else '') + ' '.join(out)

# MONEY regex gates (ordered, case-insensitive)
money_res = [
    re.compile(r'^(-?\d+(?:[ \u00A0]?\d{3})*(?:[.,]\d{1,2})?)\s*(\u0442\u044b\u0441\.?|\u043c\u043b\u043d\.?|\u043c\u043b\u0440\u0434\.?)\s*(\u20bd|\u20BD|\u0440\u0443\u0431(?:\.|\u043b\u0435\u0439|\u043b\u044f|\u043b\u044c)?|\u0440\.?|RUB)$', re.I),
    re.compile(r'^(-?\d+(?:[ \u00A0]?\d{3})*(?:[.,]\d{1,2})?)\s*(\u20bd|\u20BD|\u0440\u0443\u0431(?:\.|\u043b\u0435\u0439|\u043b\u044f|\u043b\u044c)?|\u0440\.?|RUB|\$|USD|US\$|\u20ac|EUR)$', re.I),
    re.compile(r'^(\u20bd|\u20BD|\u0440\u0443\u0431\.?|\u0440\.?|RUB|\$|USD|US\$|\u20ac|EUR)\s*(-?\d+(?:[ \u00A0]?\d{3})*(?:[.,]\d{1,2})?)$', re.I),
    re.compile(r'^-?\d+(?:[ \u00A0]?\d{3})*\s*(\u20bd|\u20BD|\u0440\u0443\u0431\.?|\u0440\.?|RUB)$', re.I),
    re.compile(r'^-?\d+(?:[ \u00A0]?\d{3})*(?:[.,]\d{1,2})?\s*(\u0442\u044b\u0441\.?|\u043c\u043b\u043d\.?|\u043c\u043b\u0440\u0434\.?)\s*(\u20bd|\u20BD|\u0440\u0443\u0431\.?|\u0440\.?|RUB)$', re.I),
    re.compile(r'^(?:\$|USD|US\$|\u20ac|EUR)\s*-?\d+(?:[ \u00A0]?\d{3})*(?:[.,]\d{1,2})?$', re.I),
]

def plural_form(n, one, two_four, many):
    try: n = abs(int(n))
    except Exception: return many; 
    if 11 <= n % 100 <= 14: return many
    if n % 10 == 1: return one
    if n % 10 in (2,3,4): return two_four
    return many

def parse_number_parts(t: str):
    t = t.replace('\u00A0',' ').replace(' ','').lstrip('+')
    neg = t.startswith('-');
    if neg: t = t[1:]
    if ',' in t: ip, fp = t.split(',', 1)
    elif '.' in t: ip, fp = t.split('.', 1)
    else: ip, fp = t, ''
    return neg, ip, fp

def to_money_ru(token: str) -> str | None:
    s = token.strip()
    for rx in money_res:
        m = rx.match(s)
        if not m: continue
        try:
            groups = m.groups()
            # Determine where number and currency/scale are
            if rx is money_res[0]:
                num_s, scale, cur = groups[0], groups[1].lower().rstrip('.'), groups[2].upper()
            elif rx is money_res[1]:
                num_s, scale, cur = groups[0], None, groups[1].upper()
            elif rx is money_res[2]:
                cur, num_s = groups[0].upper(), groups[1]
                scale = None
            elif rx is money_res[3]:
                num_s, cur, scale = s[:m.start(1)].strip(), groups[0].upper(), None
            elif rx is money_res[4]:
                num_s, scale, cur = s[:m.start(2)].strip(), groups[0].lower().rstrip('.'), groups[1].upper()
            else: # last
                cur = s[:m.start(0)+0].split()[0].upper() if s else ''
                num_s = s[s.find(cur)+len(cur):].strip()
                scale = None

            neg, ip, fp = parse_number_parts(num_s)
            ip_val = int(ip) if ip else 0
            # Minor part (2 digits max)
            minor = 0
            if fp:
                minor = int((fp + '0')[:2])

            def rub_forms(n): return ('\u0440\u0443\u0431\u043b\u044c','\u0440\u0443\u0431\u043b\u044f','\u0440\u0443\u0431\u043b\u0435\u0439')[0 if (n%10==1 and n%100!=11) else 1 if (n%10 in (2,3,4) and not 12<=n%100<=14) else 2]
            def kop_forms(n): return ('\u043a\u043e\u043f\u0435\u0439\u043a\u0430','\u043a\u043e\u043f\u0435\u0439\u043a\u0438','\u043a\u043e\u043f\u0435\u0435\u043a')[0 if (n%10==1 and n%100!=11) else 1 if (n%10 in (2,3,4) and not 12<=n%100<=14) else 2]
            def usd_forms(n): return ('\u0434\u043e\u043b\u043b\u0430\u0440','\u0434\u043e\u043b\u043b\u0430\u0440\u0430','\u0434\u043e\u043b\u043b\u0430\u0440\u043e\u0432')[0 if (n%10==1 and n%100!=11) else 1 if (n%10 in (2,3,4) and not 12<=n%100<=14) else 2]
            def cent_forms(n): return ('\u0446\u0435\u043d\u0442','\u0446\u0435\u043d\u0442\u0430','\u0446\u0435\u043d\u0442\u043e\u0432')[0 if (n%10==1 and n%100!=11) else 1 if (n%10 in (2,3,4) and not 12<=n%100<=14) else 2]

            cur_is_rub = cur in {'\u20bd','\u20BD','\u0420\u0423\u0411.','\u0420\u0423\u0411','\u0420.','RUB','\u0420\u0423\u0411\u041b\u042c','\u0420\u0423\u0411\u041b\u0415\u0419','\u0420\u0423\u0411\u041b\u042f'}
            cur_is_usd = cur in {'$','USD','US$'}
            cur_is_eur = cur in {'\u20ac','EUR'}

            # Scales
            scale_word = None
            if scale:
                if scale.startswith('\u0442\u044b\u0441'): scale_word = '\u0442\u044b\u0441\u044f\u0447'
                elif scale.startswith('\u043c\u043b\u043d'): scale_word = '\u043c\u0438\u043b\u043b\u0438\u043e\u043d\u043e\u0432'
                elif scale.startswith('\u043c\u043b\u0440\u0434'): scale_word = '\u043c\u0438\u043b\u043b\u0438\u0430\u0440\u0434\u043e\u0432'

            # Handle scales
            if scale_word:
                # Only safe fractional scale: 1.5
                if fp and not (ip_val == 1 and (fp[:1] == '5')):
                    return None
                if ip_val == 1 and fp and fp[:1] == '5':
                    # 1.5 scale
                    if scale.startswith('\u0442\u044b\u0441'): head = '\u043f\u043e\u043b\u0442\u043e\u0440\u044b \u0442\u044b\u0441\u044f\u0447\u0438'
                    else: head = '\u043f\u043e\u043b\u0442\u043e\u0440\u0430 ' + ('\u043c\u0438\u043b\u043b\u0438\u043e\u043d\u0430' if scale.startswith('\u043c\u043b\u043d') else '\u043c\u0438\u043b\u043b\u0438\u0430\u0440\u0434\u0430')
                else:
                    head = f"{n2w(ip_val, lang='ru')} {scale_word}"
                # Append currency plural by ip
                if cur_is_rub:
                    unit = plural_form(ip_val, '\u0440\u0443\u0431\u043b\u044c','\u0440\u0443\u0431\u043b\u044f','\u0440\u0443\u0431\u043b\u0435\u0439')
                    out = f"{head} {unit}"
                elif cur_is_usd:
                    unit = usd_forms(ip_val)
                    out = f"{head} {unit}"
                elif cur_is_eur:
                    out = f"{head} \u0435\u0432\u0440\u043e"
                else:
                    return None
                if neg: out = '\u043c\u0438\u043d\u0443\u0441 ' + out
                return out

            # No scale
            major_sp = n2w(ip_val, lang='ru') if ip_val != 0 else '\u043d\u043e\u043b\u044c'
            if cur_is_rub:
                unit = plural_form(ip_val, '\u0440\u0443\u0431\u043b\u044c','\u0440\u0443\u0431\u043b\u044f','\u0440\u0443\u0431\u043b\u0435\u0439')
                out = f"{major_sp} {unit}"
                if minor > 0:
                    k_unit = plural_form(minor, '\u043a\u043e\u043f\u0435\u0439\u043a\u0430','\u043a\u043e\u043f\u0435\u0439\u043a\u0438','\u043a\u043e\u043f\u0435\u0435\u043a')
                    out += f" {n2w(minor, lang='ru')} {k_unit}"
            elif cur_is_usd:
                unit = usd_forms(ip_val)
                out = f"{major_sp} {unit}"
                if minor > 0:
                    out += f" {n2w(minor, lang='ru')} {cent_forms(minor)}"
            elif cur_is_eur:
                out = f"{major_sp} \u0435\u0432\u0440\u043e"
                if minor > 0:
                    out += f" {n2w(minor, lang='ru')} {cent_forms(minor)}"
            else:
                return None
            if neg: out = '\u043c\u0438\u043d\u0443\u0441 ' + out
            return out
        except Exception:
            return None
    return None

# DECIMAL verbalizer with guards
def to_decimal_ru(token: str) -> str | None:
    s = token.strip()
    # deny list
    if date_ddmmyyyy_any_re.match(s): return None
    if ip_or_version_re.match(s): return None
    if has_letter_re.search(s): return None
    if url_email_re.search(s): return None
    if sci_re.match(s): return None
    if time_like_re.match(s): return None
    # accept list
    if not (decimal_accept_1.match(s) or decimal_accept_2.match(s)):
        return None
    try:
        neg = s.startswith('-') or s.startswith('+') and s[0] == '-'
        t = s.lstrip('+-')
        if t.startswith('.'): ip, fp = '0', t[1:]
        else:
            if ',' in t: ip, fp = t.split(',', 1)
            else: ip, fp = t.split('.', 1)
        ip_sp = n2w(int(ip) if ip else 0, lang='ru')
        frac_digits = ' '.join(digit_words[int(ch)] for ch in fp if ch.isdigit())
        if not frac_digits: return None
        out = f"{ip_sp} \u0437\u0430\u043f\u044f\u0442\u0430\u044f {frac_digits}"
        if s.strip().startswith('-'): out = '\u043c\u0438\u043d\u0443\u0441 ' + out
        return out
    except Exception:
        return None

# MEASURE regex and unit forms
measure_re = re.compile(r'^-?\d{1,3}(?:[ \u00A0]?\d{3})*(?:[.,]\d+)?\s*(\u043c\u043c|\u0441\u043c|\u0434\u043c|\u043c|\u043a\u043c|\u043c\u00b2|\u043c2|\u043a\u043c\u00b2|\u043c\u00b3|\u043c3|\u0433|\u043a\u0433|\u0442|\u0446|\u043c\u043b|\u043b|\u043a\u043c/\u0447|\u043c/\u0441|\u00b0C|\u00b0\u0421|\u00b0F|\u0412|\u0410|\u0412\u0442|\u043a\u0412\u0442|\u041c\u0412\u0442|\u043a\u0412\u0442\u00b7\u0447|\u0411|\u041a\u0411|\u043a\u0411|\u041c\u0411|\u0413\u0411|\u0422\u0411|\u041f\u0430|\u043a\u041f\u0430|\u0431\u0430\u0440|\u0430\u0442\u043c|\u043c\u043c\s\u0440\u0442\.?\s\u0441\u0442\.)$', re.I)
unit_forms = {
    '\u043c\u043c': ('\u043c\u0438\u043b\u043b\u0438\u043c\u0435\u0442\u0440','\u043c\u0438\u043b\u043b\u0438\u043c\u0435\u0442\u0440\u0430','\u043c\u0438\u043b\u043b\u0438\u043c\u0435\u0442\u0440\u043e\u0432'),
    '\u0441\u043c': ('\u0441\u0430\u043d\u0442\u0438\u043c\u0435\u0442\u0440','\u0441\u0430\u043d\u0442\u0438\u043c\u0435\u0442\u0440\u0430','\u0441\u0430\u043d\u0442\u0438\u043c\u0435\u0442\u0440\u043e\u0432'),
    '\u0434\u043c': ('\u0434\u0435\u0446\u0438\u043c\u0435\u0442\u0440','\u0434\u0435\u0446\u0438\u043c\u0435\u0442\u0440\u0430','\u0434\u0435\u0446\u0438\u043c\u0435\u0442\u0440\u043e\u0432'),
    '\u043c': ('\u043c\u0435\u0442\u0440','\u043c\u0435\u0442\u0440\u0430','\u043c\u0435\u0442\u0440\u043e\u0432'),
    '\u043a\u043c': ('\u043a\u0438\u043b\u043e\u043c\u0435\u0442\u0440','\u043a\u0438\u043b\u043e\u043c\u0435\u0442\u0440\u0430','\u043a\u0438\u043b\u043e\u043c\u0435\u0442\u0440\u043e\u0432'),
    '\u043c\u00b2': ('\u043a\u0432\u0430\u0434\u0440\u0430\u0442\u043d\u044b\u0439 \u043c\u0435\u0442\u0440','\u043a\u0432\u0430\u0434\u0440\u0430\u043d\u0442\u043d\u044b\u0445 \u043c\u0435\u0442\u0440\u0430','\u043a\u0432\u0430\u0434\u0440\u0430\u043d\u0442\u043d\u044b\u0445 \u043c\u0435\u0442\u0440\u043e\u0432'),
    '\u043c2': ('\u043a\u0432\u0430\u0434\u0440\u0430\u043d\u0442\u043d\u044b\u0439 \u043c\u0435\u0442\u0440','\u043a\u0432\u0430\u0434\u0440\u0430\u043d\u0442\u043d\u044b\u0445 \u043c\u0435\u0442\u0440\u0430','\u043a\u0432\u0430\u0434\u0440\u0430\u043d\u0442\u043d\u044b\u0445 \u043c\u0435\u0442\u0440\u043e\u0432'),
    '\u043a\u043c\u00b2': ('\u043a\u0432\u0430\u0434\u0440\u0430\u0442\u043d\u044b\u0439 \u043a\u0438\u043b\u043e\u043c\u0435\u0442\u0440','\u043a\u0432\u0430\u0434\u0440\u0430\u043d\u0442\u043d\u044b\u0445 \u043a\u0438\u043b\u043e\u043c\u0435\u0442\u0440\u0430','\u043a\u0432\u0430\u0434\u0440\u0430\u043d\u0442\u043d\u044b\u0445 \u043a\u0438\u043b\u043e\u043c\u0435\u0442\u0440\u043e\u0432'),
    '\u043c\u00b3': ('\u043a\u0443\u0431\u0438\u0447\u0435\u0441\u043a\u0438\u0439 \u043c\u0435\u0442\u0440','\u043a\u0443\u0431\u0438\u0447\u0435\u0441\u043a\u0438\u0445 \u043c\u0435\u0442\u0440\u0430','\u043a\u0443\u0431\u0438\u0447\u0435\u0441\u043a\u0438\u0445 \u043c\u0435\u0442\u0440\u043e\u0432'),
    '\u043c3': ('\u043a\u0443\u0431\u0438\u0447\u0435\u0441\u043a\u0438\u0439 \u043c\u0435\u0442\u0440','\u043a\u0443\u0431\u0438\u0447\u0435\u0441\u043a\u0438\u0445 \u043c\u0435\u0442\u0440\u0430','\u043a\u0443\u0431\u0438\u0447\u0435\u0441\u043a\u0438\u0445 \u043c\u0435\u0442\u0440\u043e\u0432'),
    '\u0433': ('\u0433\u0440\u0430\u043c\u043c','\u0433\u0440\u0430\u043c\u043c\u0430','\u0433\u0440\u0430\u043c\u043c\u043e\u0432'),
    '\u043a\u0433': ('\u043a\u0438\u043b\u043e\u0433\u0440\u0430\u043c\u043c','\u043a\u0438\u043b\u043e\u0433\u0440\u0430\u043c\u043c\u0430','\u043a\u0438\u043b\u043e\u0433\u0440\u0430\u043c\u043c\u043e\u0432'),
    '\u0442': ('\u0442\u043e\u043d\u043d\u0430','\u0442\u043e\u043d\u043d\u044b','\u0442\u043e\u043d\u043d'),
    '\u0446': ('\u0446\u0435\u043d\u0442\u043d\u0435\u0440','\u0446\u0435\u043d\u0442\u043d\u0435\u0440\u0430','\u0446\u0435\u043d\u0442\u043d\u0435\u0440\u043e\u0432'),
    '\u043c\u043b': ('\u043c\u0438\u043b\u043b\u0438\u043b\u0438\u0442\u0440','\u043c\u0438\u043b\u043b\u0438\u043b\u0438\u0442\u0440\u0430','\u043c\u0438\u043b\u043b\u0438\u043b\u0438\u0442\u0440\u043e\u0432'),
    '\u043b': ('\u043b\u0438\u0442\u0440','\u043b\u0438\u0442\u0440\u0430','\u043b\u0438\u0442\u0440\u043e\u0432'),
    '\u043a\u043c/\u0447': ('\u043a\u0438\u043b\u043e\u043c\u0435\u0442\u0440 \u0432 \u0447\u0430\u0441','\u043a\u0438\u043b\u043e\u043c\u0435\u0442\u0440\u0430 \u0432 \u0447\u0430\u0441','\u043a\u0438\u043b\u043e\u043c\u0435\u0442\u0440\u043e\u0432 \u0432 \u0447\u0430\u0441'),
    '\u043c/\u0441': ('\u043c\u0435\u0442\u0440 \u0432 \u0441\u0435\u043a\u0443\u043d\u0434\u0443','\u043c\u0435\u0442\u0440\u0430 \u0432 \u0441\u0435\u043a\u0443\u043d\u0434\u0443','\u043c\u0435\u0442\u0440\u043e\u0432 \u0432 \u0441\u0435\u043a\u0443\u043d\u0434\u0443'),
    '\u00b0C': ('\u0433\u0440\u0430\u0434\u0443\u0441 \u0426\u0435\u043b\u044c\u0441\u0438\u044f','\u0433\u0440\u0430\u0434\u0443\u0441\u0430 \u0426\u0435\u043b\u044c\u0441\u0438\u044f','\u0433\u0440\u0430\u0434\u0443\u0441\u043e\u0432 \u0426\u0435\u043b\u044c\u0441\u0438\u044f'),
    '\u00b0\u0421': ('\u0433\u0440\u0430\u0434\u0443\u0441 \u0426\u0435\u043b\u044c\u0441\u0438\u044f','\u0433\u0440\u0430\u0434\u0443\u0441\u0430 \u0426\u0435\u043b\u044c\u0441\u0438\u044f','\u0433\u0440\u0430\u0434\u0443\u0441\u043e\u0432 \u0426\u0435\u043b\u044c\u0441\u0438\u044f'),
    '\u00b0F': ('\u0433\u0440\u0430\u0434\u0443\u0441 \u0424\u0430\u0440\u0435\u043d\u0433\u0435\u0439\u0442\u0430','\u0433\u0440\u0430\u0434\u0443\u0441\u0430 \u0424\u0430\u0440\u0435\u043d\u0433\u0435\u0439\u0442\u0430','\u0433\u0440\u0430\u0434\u0443\u0441\u043e\u0432 \u0424\u0430\u0440\u0435\u043d\u0433\u0435\u0439\u0442\u0430'),
    '\u0412': ('\u0432\u043e\u043b\u044c\u0442','\u0432\u043e\u043b\u044c\u0442\u0430','\u0432\u043e\u043b\u044c\u0442'),
    '\u0410': ('\u0430\u043c\u043f\u0435\u0440','\u0430\u043c\u043f\u0435\u0440\u0430','\u0430\u043c\u043f\u0435\u0440'),
    '\u0412\u0442': ('\u0432\u0430\u0442\u0442','\u0432\u0430\u0442\u0442\u0430','\u0432\u0430\u0442\u0442'),
    '\u043a\u0412\u0442': ('\u043a\u0438\u043b\u043e\u0432\u0430\u0442\u0442','\u043a\u0438\u043b\u043e\u0432\u0430\u0442\u0442\u0430','\u043a\u0438\u043b\u043e\u0432\u0430\u0442\u0442'),
    '\u041c\u0412\u0442': ('\u043c\u0435\u0433\u0430\u0432\u0430\u0442\u0442','\u043c\u0435\u0433\u0430\u0432\u0430\u0442\u0442\u0430','\u043c\u0435\u0433\u0430\u0432\u0430\u0442\u0442'),
    '\u043a\u0412\u0442\u00b7\u0447': ('\u043a\u0438\u043b\u043e\u0432\u0430\u0442\u0442-\u0447\u0430\u0441','\u043a\u0438\u043b\u043e\u0432\u0430\u0442\u0442-\u0447\u0430\u0441\u0430','\u043a\u0438\u043b\u043e\u0432\u0430\u0442\u0442-\u0447\u0430\u0441\u043e\u0432'),
    '\u0411': ('\u0431\u0430\u0439\u0442','\u0431\u0430\u0439\u0442\u0430','\u0431\u0430\u0439\u0442'),
    '\u041a\u0411': ('\u043a\u0438\u043b\u043e\u0431\u0430\u0439\u0442','\u043a\u0438\u043b\u043e\u0431\u0430\u0439\u0442\u0430','\u043a\u0438\u043b\u043e\u0431\u0430\u0439\u0442'),
    '\u043a\u0411': ('\u043a\u0438\u043b\u043e\u0431\u0430\u0439\u0442','\u043a\u0438\u043b\u043e\u0431\u0430\u0439\u0442\u0430','\u043a\u0438\u043b\u043e\u0431\u0430\u0439\u0442'),
    '\u041c\u0411': ('\u043c\u0435\u0433\u0430\u0431\u0430\u0439\u0442','\u043c\u0435\u0433\u0430\u0431\u0430\u0439\u0442\u0430','\u043c\u0435\u0433\u0430\u0431\u0430\u0439\u0442'),
    '\u0413\u0411': ('\u0433\u0438\u0433\u0430\u0431\u0430\u0439\u0442','\u0433\u0438\u0433\u0430\u0431\u0430\u0439\u0442\u0430','\u0433\u0438\u0433\u0430\u0431\u0430\u0439\u0442'),
    '\u0422\u0411': ('\u0442\u0435\u0440\u0430\u0431\u0430\u0439\u0442','\u0442\u0435\u0440\u0430\u0431\u0430\u0439\u0442\u0430','\u0442\u0435\u0440\u0430\u0431\u0430\u0439\u0442'),
    '\u041f\u0430': ('\u043f\u0430\u0441\u043a\u0430\u043b\u044c','\u043f\u0430\u0441\u043a\u0430\u043b\u044f','\u043f\u0430\u0441\u043a\u0430\u043b\u0435\u0439'),
    '\u043a\u041f\u0430': ('\u043a\u0438\u043b\u043e\u043f\u0430\u0441\u043a\u0430\u043b\u044c','\u043a\u0438\u043b\u043e\u043f\u0430\u0441\u043a\u0430\u043b\u044f','\u043a\u0438\u043b\u043e\u043f\u0430\u0441\u043a\u0430\u043b\u0435\u0439'),
    '\u0431\u0430\u0440': ('\u0431\u0430\u0440','\u0431\u0430\u0440\u0430','\u0431\u0430\u0440'),
    '\u0430\u0442\u043c': ('\u0430\u0442\u043c\u043e\u0441\u0444\u0435\u0440\u0430','\u0430\u0442\u043c\u043e\u0441\u0444\u0435\u0440\u044b','\u0430\u0442\u043c\u043e\u0441\u0444\u0435\u0440'),
    '\u043c\u043c \u0440\u0442. \u0441\u0442.': ('\u043c\u0438\u043b\u043b\u0438\u043c\u0435\u0442\u0440 \u0440\u0442\u0443\u0442\u043d\u043e\u0433\u043e \u0441\u0442\u043e\u043b\u0431\u0430','\u043c\u0438\u043b\u043b\u0438\u043c\u0435\u0442\u0440\u0430 \u0440\u0442\u0443\u0442\u043d\u043e\u0433\u043e \u0441\u0442\u043e\u043b\u0431\u0430','\u043c\u0438\u043b\u043b\u0438\u043c\u0435\u0442\u0440\u043e\u0432 \u0440\u0442\u0443\u0442\u043d\u043e\u0433\u043e \u0441\u0442\u043e\u043b\u0431\u0430'),
}

def to_measure_ru(token: str) -> str | None:
    m = measure_re.match(token.strip())
    if not m: return None
    try:
        unit_raw = m.group(1)
        # unify unit key
        unit_key = unit_raw
        # special normalize for '\u043c\u043c \u0440\u0442. \u0441\u0442.'
        if unit_key.lower().startswith('\u043c\u043c'): unit_key = '\u043c\u043c \u0440\u0442. \u0441\u0442.' if '\u0440\u0442' in token else unit_key
        # numeric part
        num_part = token[:m.start(1)].strip()
        neg, ip, fp = parse_number_parts(num_part)
        ip_val = int(ip) if ip else 0
        num_sp = n2w(ip_val, lang='ru')
        forms = unit_forms.get(unit_key, None)
        if forms is None: return None
        unit_word = plural_form(ip_val, forms[0], forms[1], forms[2])
        out = f"{num_sp} {unit_word}"
        if neg: out = '\u043c\u0438\u043d\u0443\u0441 ' + out
        return out
    except Exception:
        return None

# Context-aware memorizer builders
def build_context_maps(df: pd.DataFrame, min_tri=3, min_bi=5):
    # df must have before_norm, after, prev_before_norm, next_before_norm
    # trigram
    tri_counts = df.value_counts(['prev_before_norm','before_norm','next_before_norm','after']).reset_index(name='cnt')
    tri_counts = tri_counts.sort_values(['prev_before_norm','before_norm','next_before_norm','cnt'], ascending=[True,True,True,False])
    tri_counts = tri_counts[tri_counts['cnt']>=min_tri]
    tri_map = tri_counts.drop_duplicates(['prev_before_norm','before_norm','next_before_norm']).set_index(['prev_before_norm','before_norm','next_before_norm'])['after'].to_dict()
    # left bigram
    lb_counts = df.value_counts(['prev_before_norm','before_norm','after']).reset_index(name='cnt')
    lb_counts = lb_counts.sort_values(['prev_before_norm','before_norm','cnt'], ascending=[True,True,False])
    lb_counts = lb_counts[lb_counts['cnt']>=min_bi]
    lb_map = lb_counts.drop_duplicates(['prev_before_norm','before_norm']).set_index(['prev_before_norm','before_norm'])['after'].to_dict()
    # right bigram
    rb_counts = df.value_counts(['before_norm','next_before_norm','after']).reset_index(name='cnt')
    rb_counts = rb_counts.sort_values(['before_norm','next_before_norm','cnt'], ascending=[True,True,False])
    rb_counts = rb_counts[rb_counts['cnt']>=min_bi]
    rb_map = rb_counts.drop_duplicates(['before_norm','next_before_norm']).set_index(['before_norm','next_before_norm'])['after'].to_dict()
    # unigram
    uni_counts = df.value_counts(['before_norm','after']).reset_index(name='cnt')
    uni_counts = uni_counts.sort_values(['before_norm','cnt'], ascending=[True,False])
    uni_map = uni_counts.drop_duplicates(['before_norm']).set_index(['before_norm'])['after'].to_dict()
    return tri_map, lb_map, rb_map, uni_map

print('Loading full train...')
train = pd.read_csv('ru_train.csv.zip', compression='zip', usecols=['sentence_id','token_id','class','before','after'])
train['before_norm'] = train['before'].map(normalize_before)
print('Train shape:', train.shape)
# attach prev/next normalized tokens
train = train.sort_values(['sentence_id','token_id'])
train['prev_before_norm'] = train.groupby('sentence_id')['before_norm'].shift(1).fillna('')
train['next_before_norm'] = train.groupby('sentence_id')['before_norm'].shift(-1).fillna('')

gkf = GroupKFold(n_splits=5)
groups = train['sentence_id'].values
oof_pred = pd.Series(index=train.index, dtype=object)
t0 = time.time()
for fold, (tr_idx, va_idx) in enumerate(gkf.split(train, groups=groups), 1):
    tf = time.time()
    print(f'Fold {fold}: train {len(tr_idx):,}, valid {len(va_idx):,}')
    tr = train.loc[tr_idx, ['before_norm','after','prev_before_norm','next_before_norm']]
    tri_map, lb_map, rb_map, uni_map = build_context_maps(tr, min_tri=3, min_bi=5)
    # validation slice with context keys
    va = train.loc[va_idx, ['before','before_norm','after','prev_before_norm','next_before_norm']]

    preds = []
    for (_, row) in va.iterrows():
        b_raw = row['before']; b = row['before_norm']
        prevb = row['prev_before_norm']; nextb = row['next_before_norm']
        out = None
        if isinstance(b, str):
            # Context-aware memorizer cascade: tri -> left bi -> right bi -> uni
            tmp = tri_map.get((prevb,b,nextb), None)
            if tmp and tmp != '<self>': out = tmp
            if out is None:
                tmp = lb_map.get((prevb,b), None)
                if tmp and tmp != '<self>': out = tmp
            if out is None:
                tmp = rb_map.get((b,nextb), None)
                if tmp and tmp != '<self>': out = tmp
            if out is None:
                tmp = uni_map.get(b, None)
                if tmp and tmp != '<self>': out = tmp
            # ELECTRONIC guard
            if out is None and url_email_re.search(b):
                tmp = uni_map.get(b, None)
                out = tmp if (tmp is not None and tmp != '<self>') else b_raw
            # TELEPHONE
            if out is None and (tel_re_generic.match(b) or tel_re_ru_1.match(b) or tel_re_ru_2.match(b)):
                out = speak_digits_ru(b)
            # DATE
            if out is None:
                try:
                    out = to_date_ru(b)
                except NameError:
                    out = None
            # ORDINAL
            if out is None:
                try:
                    out = to_ordinal_ru(b)
                except NameError:
                    out = None
            # TIME
            if out is None:
                try:
                    out = to_time_ru(b)
                except NameError:
                    out = None
            # MONEY
            if out is None:
                out = to_money_ru(b)
            # FRACTION
            if out is None:
                try:
                    out = to_fraction_ru(b)
                except NameError:
                    out = None
            # PERCENT
            if out is None and '%' in b:
                try:
                    out = to_percent_ru(b)
                except NameError:
                    out = None
            # DECIMAL guarded
            if out is None:
                out = to_decimal_ru(b)
            # MEASURE
            if out is None:
                out = to_measure_ru(b)
            # CARDINAL
            if out is None and int_re.match(b) and not url_email_re.search(b):
                try:
                    out = to_int_cardinal_ru(b)
                except NameError:
                    out = None
        if out is None: out = b_raw
        preds.append(out)
    preds = pd.Series(preds, index=va_idx)
    oof_pred.loc[va_idx] = preds
    acc = (preds.values == train.loc[va_idx, 'after'].values).mean()
    print(f'  fold {fold} acc: {acc:.6f} | elapsed {time.time()-tf:.1f}s')

overall = (oof_pred.values == train['after'].values).mean()
print(f'v5 router OOF overall accuracy: {overall:.6f}')
print('Per-class OOF (top 20):')
cls_acc = train.assign(pred=oof_pred).groupby('class').apply(lambda df: (df['pred'].values == df['after'].values).mean()).sort_values(ascending=False)
print(cls_acc.head(20))
print('Done. Total elapsed:', f'{time.time()-t0:.1f}s')

Loading full train...


Train shape: (9515325, 6)


Fold 1: train 7,612,260, valid 1,903,065


  fold 1 acc: 0.977857 | elapsed 171.3s
Fold 2: train 7,612,260, valid 1,903,065


  fold 2 acc: 0.977781 | elapsed 169.6s
Fold 3: train 7,612,260, valid 1,903,065


  fold 3 acc: 0.977824 | elapsed 169.4s
Fold 4: train 7,612,260, valid 1,903,065


  fold 4 acc: 0.977869 | elapsed 169.1s
Fold 5: train 7,612,260, valid 1,903,065


  fold 5 acc: 0.977819 | elapsed 167.9s
v5 router OOF overall accuracy: 0.977830
Per-class OOF (top 20):


class
PUNCT         0.999345
VERBATIM      0.989484
PLAIN         0.986002
LETTERS       0.941345
DIGIT         0.900336
TIME          0.872244
CARDINAL      0.869176
ORDINAL       0.849830
DATE          0.798244
FRACTION      0.637986
MEASURE       0.569446
ELECTRONIC    0.418627
DECIMAL       0.361136
MONEY         0.223794
TELEPHONE     0.204724
dtype: float64
Done. Total elapsed: 859.1s


  cls_acc = train.assign(pred=oof_pred).groupby('class').apply(lambda df: (df['pred'].values == df['after'].values).mean()).sort_values(ascending=False)


In [None]:
# Inference v5: context-aware memorizer (tri/bi/uni) + advanced routing, mirrors Cell 15
import re, time, pandas as pd, numpy as np
from collections import defaultdict, Counter
from num2words import num2words as n2w

# Reuse helpers from Cell 15 (assumes already defined in kernel):
# normalize_before, to_date_ru, to_time_ru, to_ordinal_ru, to_fraction_ru, to_percent_ru, to_int_cardinal_ru,
# speak_digits_ru, to_money_ru, to_decimal_ru, to_measure_ru, plural_form,
# regex: url_email_re, tel_re_generic, tel_re_ru_1, tel_re_ru_2, int_re, dec accept/deny inside to_decimal_ru

def build_context_maps_full(train_df: pd.DataFrame, min_tri=3, min_bi=5):
    # train_df must have before_norm, after, prev_before_norm, next_before_norm
    t0 = time.time()
    print('Building trigram map...')
    tri_counts = train_df.value_counts(['prev_before_norm','before_norm','next_before_norm','after']).reset_index(name='cnt')
    tri_counts = tri_counts.sort_values(['prev_before_norm','before_norm','next_before_norm','cnt'], ascending=[True,True,True,False])
    tri_counts = tri_counts[tri_counts['cnt']>=min_tri]
    tri_map = tri_counts.drop_duplicates(['prev_before_norm','before_norm','next_before_norm']).set_index(['prev_before_norm','before_norm','next_before_norm'])['after'].to_dict()
    print(f'  trigram entries: {len(tri_map):,} | elapsed {time.time()-t0:.1f}s'); t1=time.time()
    print('Building left-bigram map...')
    lb_counts = train_df.value_counts(['prev_before_norm','before_norm','after']).reset_index(name='cnt')
    lb_counts = lb_counts.sort_values(['prev_before_norm','before_norm','cnt'], ascending=[True,True,False])
    lb_counts = lb_counts[lb_counts['cnt']>=min_bi]
    lb_map = lb_counts.drop_duplicates(['prev_before_norm','before_norm']).set_index(['prev_before_norm','before_norm'])['after'].to_dict()
    print(f'  left-bigram entries: {len(lb_map):,} | elapsed {time.time()-t1:.1f}s'); t2=time.time()
    print('Building right-bigram map...')
    rb_counts = train_df.value_counts(['before_norm','next_before_norm','after']).reset_index(name='cnt')
    rb_counts = rb_counts.sort_values(['before_norm','next_before_norm','cnt'], ascending=[True,True,False])
    rb_counts = rb_counts[rb_counts['cnt']>=min_bi]
    rb_map = rb_counts.drop_duplicates(['before_norm','next_before_norm']).set_index(['before_norm','next_before_norm'])['after'].to_dict()
    print(f'  right-bigram entries: {len(rb_map):,} | elapsed {time.time()-t2:.1f}s'); t3=time.time()
    print('Building unigram map...')
    uni_counts = train_df.value_counts(['before_norm','after']).reset_index(name='cnt')
    uni_counts = uni_counts.sort_values(['before_norm','cnt'], ascending=[True,False])
    uni_map = uni_counts.drop_duplicates(['before_norm']).set_index(['before_norm'])['after'].to_dict()
    print(f'  unigram entries: {len(uni_map):,} | elapsed {time.time()-t3:.1f}s')
    print('Context maps built. Total elapsed:', f'{time.time()-t0:.1f}s')
    return tri_map, lb_map, rb_map, uni_map

def route_token_v5(b_raw, b_norm, prevb, nextb, tri_map, lb_map, rb_map, uni_map):
    out = None
    if isinstance(b_norm, str):
        # Context-aware memorizer cascade: tri -> left bi -> right bi -> uni (skip literal '<self>')
        tmp = tri_map.get((prevb,b_norm,nextb), None)
        if tmp and tmp != '<self>': out = tmp
        if out is None:
            tmp = lb_map.get((prevb,b_norm), None)
            if tmp and tmp != '<self>': out = tmp
        if out is None:
            tmp = rb_map.get((b_norm,nextb), None)
            if tmp and tmp != '<self>': out = tmp
        if out is None:
            tmp = uni_map.get(b_norm, None)
            if tmp and tmp != '<self>': out = tmp
        # ELECTRONIC guard (early)
        if out is None and url_email_re.search(b_norm):
            tmp = uni_map.get(b_norm, None)
            return tmp if (tmp is not None and tmp != '<self>') else b_raw
        # TELEPHONE
        if out is None and (tel_re_generic.match(b_norm) or tel_re_ru_1.match(b_norm) or tel_re_ru_2.match(b_norm)):
            out = speak_digits_ru(b_norm)
        # DATE
        if out is None:
            out = to_date_ru(b_norm)
        # ORDINAL
        if out is None:
            out = to_ordinal_ru(b_norm)
        # TIME
        if out is None:
            out = to_time_ru(b_norm)
        # MONEY
        if out is None:
            out = to_money_ru(b_norm)
        # FRACTION
        if out is None:
            out = to_fraction_ru(b_norm)
        # PERCENT
        if out is None and '%' in b_norm:
            out = to_percent_ru(b_norm)
        # DECIMAL (guarded within function)
        if out is None:
            out = to_decimal_ru(b_norm)
        # MEASURE
        if out is None:
            out = to_measure_ru(b_norm)
        # CARDINAL
        if out is None and int_re.match(b_norm) and not url_email_re.search(b_norm):
            out = to_int_cardinal_ru(b_norm)
    return out if out is not None else b_raw

train_path = 'ru_train.csv.zip'
test_path = 'ru_test_2.csv.zip'

t0 = time.time()
print('Loading train for context maps...')
train = pd.read_csv(train_path, compression='zip', usecols=['sentence_id','token_id','before','after'])
train['before_norm'] = train['before'].map(normalize_before)
train = train.sort_values(['sentence_id','token_id'])
train['prev_before_norm'] = train.groupby('sentence_id')['before_norm'].shift(1).fillna('')
train['next_before_norm'] = train.groupby('sentence_id')['before_norm'].shift(-1).fillna('')
print('Train ready:', train.shape)

tri_map, lb_map, rb_map, uni_map = build_context_maps_full(train[['prev_before_norm','before_norm','next_before_norm','after']], min_tri=3, min_bi=5)
del train

print('Loading test...')
test = pd.read_csv(test_path, compression='zip')
test = test.sort_values(['sentence_id','token_id'])
test['id'] = test['sentence_id'].astype(str) + '_' + test['token_id'].astype(str)
test['before_norm'] = test['before'].map(normalize_before)
test['prev_before_norm'] = test.groupby('sentence_id')['before_norm'].shift(1).fillna('')
test['next_before_norm'] = test.groupby('sentence_id')['before_norm'].shift(-1).fillna('')
print('Test ready:', test.shape)

pred = []
t1 = time.time()
for i, row in enumerate(test.itertuples(index=False), 1):
    b_raw = row.before
    b = row.before_norm
    prevb = row.prev_before_norm
    nextb = row.next_before_norm
    out = route_token_v5(b_raw, b, prevb, nextb, tri_map, lb_map, rb_map, uni_map)
    pred.append(out)
    if i % 200000 == 0:
        print(f'  processed {i:,}/{len(test)} | elapsed {time.time()-t1:.1f}s', flush=True)

submission = pd.DataFrame({'id': test['id'].values, 'after': pred})
submission.to_csv('submission.csv', index=False)
print('Saved submission.csv with', len(submission), 'rows. Total elapsed:', f'{time.time()-t0:.1f}s')
print(submission.head())

Loading train for context maps...
