# Plan to Medal: Text Normalization (Russian)

## Goals
- Achieve ≥ bronze by building a strong, fast baseline, then iterate.

## Data Understanding
- Load ru_train.csv.zip, ru_test_2.csv.zip, ru_sample_submission_2.csv.zip.
- Inspect columns, size, and whether data is token-level (common for TN challenges).

## Baseline Approach
1) Majority-class per source token memorization with smoothing:
- For each source token/context key, memorize most frequent normalized target; backoff to lower-order keys and fallbacks.
- Known to be very strong for TN competitions.

2) Context-aware features:
- Use (token, semiotic class if available, left/right neighbors, casing, punctuation patterns).
- If class not provided, derive regex-based features (digits, dates, currency, roman numerals, abbreviations, etc.).

3) Model
- Start with frequency lexicon + backoff.
- Add CatBoost/LightGBM classifier per token to choose between candidates when collisions occur.
- Optionally train separate models per detected semiotic type (e.g., PLAIN, DATE, CARDINAL, etc.) if labels exist.

## Evaluation
- Create local CV split consistent with dataset structure (article/utt group-wise split if applicable).
- Metric: accuracy over tokens/rows as per competition.
- Log fold times and results.

## Inference
- Generate predictions for test using the lexicon+backoff; apply model only when ambiguous.
- Save to submission.csv exactly matching sample format.

## Timeline
- T0: Load/EDA.
- T1: Build frequency/backoff baseline; submit.
- T2: Add regex feature extractor + CatBoost disambiguation; resubmit.
- T3: Error analysis on mismatches; targeted rules (dates, times, money).

## Checkpoints
- Request expert review after plan, after EDA, after first baseline results, and before heavy training.

In [1]:
import pandas as pd
import numpy as np
import re
from collections import Counter
from time import time

t0 = time()
print("Loading data...")
train_path = "ru_train.csv.zip"
test_path = "ru_test_2.csv.zip"
sample_path = "ru_sample_submission_2.csv.zip"

train = pd.read_csv(train_path)
test = pd.read_csv(test_path)
sample = pd.read_csv(sample_path)
print(f"Loaded: train={train.shape}, test={test.shape}, sample={sample.shape}")

def info_df(name, df):
    print(f"\n=== {name} columns ===")
    print(list(df.columns))
    print(f"dtypes:\n{df.dtypes}")
    print("head:")
    print(df.head(10))

info_df("train", train)
info_df("test", test)
info_df("sample", sample)

# Expect columns like: sentence_id, (token_id or id), before, after (train only), class/semiotic_class
cols = set(train.columns)
has_sentence = 'sentence_id' in cols or 'sentence' in cols
sent_col = 'sentence_id' if 'sentence_id' in cols else ('sentence' if 'sentence' in cols else None)
id_col = 'id' if 'id' in cols else ('token_id' if 'token_id' in cols else None)
before_col = 'before' if 'before' in cols else None
after_col = 'after' if 'after' in cols else None
class_col = 'class' if 'class' in cols else ('semiotic_class' if 'semiotic_class' in cols else None)
print("\nDetected columns:")
print({
    'sent_col': sent_col,
    'id_col': id_col,
    'before_col': before_col,
    'after_col': after_col,
    'class_col': class_col
})

if class_col:
    classes = train[class_col].astype(str).fillna('NA').unique().tolist()
    print(f"\nUnique classes in train ({len(classes)}):", classes[:50])

# Check if test has class
test_has_class = class_col in test.columns if class_col else False
print(f"Test has class column: {test_has_class}")

# NBSP/NNBSP detection in before
def count_nbsp(s):
    if pd.isna(s):
        return 0
    return len(re.findall(r"[\u00A0\u202F]", str(s)))
if before_col:
    nb_train = train[before_col].astype(str).apply(lambda x: 1 if re.search(r"[\u00A0\u202F]", x) else 0).sum()
    nb_test = test[before_col].astype(str).apply(lambda x: 1 if re.search(r"[\u00A0\u202F]", x) else 0).sum() if before_col in test.columns else np.nan
    print(f"NBSP present in train rows: {nb_train}")
    print(f"NBSP present in test rows: {nb_test}")

# Tokenization identity checks
common_cols = set(train.columns).intersection(set(test.columns))
print(f"Common columns train/test: {sorted(list(common_cols))}")

# Derive prev/next within sentence if available
if sent_col and id_col and before_col:
    print("\nDeriving prev/next tokens within sentence on a small sample...")
    tmp = train[[sent_col, id_col, before_col]].copy().head(5000)
    tmp = tmp.sort_values([sent_col, id_col])
    tmp['prev_before'] = tmp.groupby(sent_col)[before_col].shift(1).fillna('<BOS>')
    tmp['next_before'] = tmp.groupby(sent_col)[before_col].shift(-1).fillna('<EOS>')
    print(tmp.head(10))
else:
    print("Skipping prev/next derivation (missing sent/id/before)")

# Basic frequency for (class,before) if class exists
if class_col and before_col and after_col:
    print("\nBuilding quick freq table for (class, before) -> top after (preview)...")
    grp = train.groupby([class_col, before_col])[after_col].agg(lambda x: Counter(x).most_common(1)[0][0]).reset_index().head(10)
    print(grp)

print(f"\nDone in {time()-t0:.2f}s")

Loading data...


Loaded: train=(9515325, 5), test=(1059191, 3), sample=(1059191, 2)

=== train columns ===
['sentence_id', 'token_id', 'class', 'before', 'after']
dtypes:
sentence_id     int64
token_id        int64
class          object
before         object
after          object
dtype: object
head:
   sentence_id  token_id  class      before  \
0            0         0  PLAIN          По   
1            0         1  PLAIN   состоянию   
2            0         2  PLAIN          на   
3            0         3   DATE    1862 год   
4            0         4  PUNCT           .   
5            1         0  PLAIN  Оснащались   
6            1         1  PLAIN     латными   
7            1         2  PLAIN  рукавицами   
8            1         3  PLAIN           и   
9            1         4  PLAIN  сабатонами   

                                    after  
0                                      По  
1                               состоянию  
2                                      на  
3  тысяча восемьсот ше


Unique classes in train (15): ['PLAIN', 'DATE', 'PUNCT', 'ORDINAL', 'VERBATIM', 'LETTERS', 'CARDINAL', 'MEASURE', 'TELEPHONE', 'ELECTRONIC', 'DECIMAL', 'DIGIT', 'FRACTION', 'TIME', 'MONEY']
Test has class column: False


NBSP present in train rows: 0
NBSP present in test rows: 0
Common columns train/test: ['before', 'sentence_id', 'token_id']

Deriving prev/next tokens within sentence on a small sample...


   sentence_id  token_id      before prev_before next_before
0            0         0          По       <BOS>   состоянию
1            0         1   состоянию          По          на
2            0         2          на   состоянию    1862 год
3            0         3    1862 год          на           .
4            0         4           .    1862 год       <EOS>
5            1         0  Оснащались       <BOS>     латными
6            1         1     латными  Оснащались  рукавицами
7            1         2  рукавицами     латными           и
8            1         3           и  рукавицами  сабатонами
9            1         4  сабатонами           и           с

Building quick freq table for (class, before) -> top after (preview)...


      class   before                                after
0  CARDINAL       -0                           минус ноль
1  CARDINAL       -1                           минус один
2  CARDINAL      -10                         минус десять
3  CARDINAL     -100                            минус сто
4  CARDINAL     -101                       минус сто один
5  CARDINAL    -1011             минус тысяча одиннадцать
6  CARDINAL  -101903  минус сто одна тысяча девятьсот три
7  CARDINAL     -102                        минус сто два
8  CARDINAL    -1024         минус тысяча двадцать четыре
9  CARDINAL    -1028         минус тысяча двадцать восемь

Done in 39.71s


In [2]:
import pandas as pd
import numpy as np
from time import time

t0 = time()
print("Deriving prev/next for train/test...")
# Minimize columns to reduce memory
tr_cols = ['sentence_id', 'token_id', 'before', 'after']
train_ctx = train[tr_cols].copy()
train_ctx = train_ctx.sort_values(['sentence_id', 'token_id'])
train_ctx['prev_before'] = train_ctx.groupby('sentence_id')['before'].shift(1).fillna('<BOS>')
train_ctx['next_before'] = train_ctx.groupby('sentence_id')['before'].shift(-1).fillna('<EOS>')

te_cols = ['sentence_id', 'token_id', 'before']
test_ctx = test[te_cols].copy()
test_ctx = test_ctx.sort_values(['sentence_id', 'token_id'])
test_ctx['prev_before'] = test_ctx.groupby('sentence_id')['before'].shift(1).fillna('<BOS>')
test_ctx['next_before'] = test_ctx.groupby('sentence_id')['before'].shift(-1).fillna('<EOS>')

print(f"Context derived in {time()-t0:.2f}s")

def build_top_map(df, key_cols, value_col='after', logname=''):
    t = time()
    cols = key_cols + [value_col]
    tmp = df[cols].copy()
    # group by key+value, count, keep top value per key
    cnt = tmp.groupby(cols, observed=True).size().reset_index(name='cnt')
    cnt = cnt.sort_values(key_cols + ['cnt'], ascending=[True]*len(key_cols) + [False])
    top = cnt.drop_duplicates(subset=key_cols, keep='first')
    print(f"Built map {logname or key_cols} with {top.shape[0]} keys in {time()-t:.2f}s")
    return top[key_cols + [value_col]]

maps = []
print("Building backoff maps...")
# K1: (before, prev, next)
maps.append(build_top_map(train_ctx, ['before', 'prev_before', 'next_before'], 'after', 'K1'))
# K2: (before, prev)
maps.append(build_top_map(train_ctx, ['before', 'prev_before'], 'after', 'K2'))
# K3: (before, next)
maps.append(build_top_map(train_ctx, ['before', 'next_before'], 'after', 'K3'))
# K4: (before)
maps.append(build_top_map(train_ctx, ['before'], 'after', 'K4'))
# K5: (lower(before))
train_ctx['before_lower'] = train_ctx['before'].str.lower()
maps.append(build_top_map(train_ctx, ['before_lower'], 'after', 'K5'))

print(f"Maps built in total {time()-t0:.2f}s")

print("Applying backoff to test...")
t1 = time()
pred = test_ctx.copy()

# Stepwise fill using merges
pred['after'] = np.nan

# K1
pred = pred.merge(maps[0].rename(columns={'after':'after_k1'}), on=['before','prev_before','next_before'], how='left')
pred['after'] = pred['after'].fillna(pred['after_k1'])
pred.drop(columns=['after_k1'], inplace=True)
print(f"After K1 filled: {pred['after'].notna().mean():.4f}")

# K2
pred = pred.merge(maps[1].rename(columns={'after':'after_k2'}), on=['before','prev_before'], how='left')
pred['after'] = pred['after'].fillna(pred['after_k2'])
pred.drop(columns=['after_k2'], inplace=True)
print(f"After K2 filled: {pred['after'].notna().mean():.4f}")

# K3
pred = pred.merge(maps[2].rename(columns={'after':'after_k3'}), on=['before','next_before'], how='left')
pred['after'] = pred['after'].fillna(pred['after_k3'])
pred.drop(columns=['after_k3'], inplace=True)
print(f"After K3 filled: {pred['after'].notna().mean():.4f}")

# K4
pred = pred.merge(maps[3].rename(columns={'after':'after_k4'}), on=['before'], how='left')
pred['after'] = pred['after'].fillna(pred['after_k4'])
pred.drop(columns=['after_k4'], inplace=True)
print(f"After K4 filled: {pred['after'].notna().mean():.4f}")

# K5 lower
pred['before_lower'] = pred['before'].str.lower()
pred = pred.merge(maps[4].rename(columns={'after':'after_k5'}), on=['before_lower'], how='left')
pred['after'] = pred['after'].fillna(pred['after_k5'])
pred.drop(columns=['after_k5','before_lower'], inplace=True)
print(f"After K5 filled: {pred['after'].notna().mean():.4f}")

# Identity fallback
miss = pred['after'].isna().sum()
if miss:
    print(f"Falling back to identity for {miss} rows")
pred['after'] = pred['after'].fillna(pred['before'])

print(f"Backoff applied in {time()-t1:.2f}s; coverage 100.00%")

print("Building submission.csv ...")
sub = pred[['sentence_id','token_id','after']].copy()
sub['id'] = sub['sentence_id'].astype(str) + '_' + sub['token_id'].astype(str)
sub = sub[['id','after']]

# Align to sample order to be safe
submission = sample[['id']].merge(sub, on='id', how='left')
na = submission['after'].isna().sum()
if na:
    print(f"Warning: {na} missing after filled via before from test")
    # Fill from test order if any missing (shouldn't happen)
    tmp_fix = test_ctx.copy()
    tmp_fix['id'] = tmp_fix['sentence_id'].astype(str) + '_' + tmp_fix['token_id'].astype(str)
    submission = submission.merge(tmp_fix[['id','before']], on='id', how='left')
    submission['after'] = submission['after'].fillna(submission['before'])
    submission.drop(columns=['before'], inplace=True)

submission.to_csv('submission.csv', index=False)
print("Saved submission.csv with shape", submission.shape)

Deriving prev/next for train/test...


Context derived in 5.41s
Building backoff maps...


Built map K1 with 6626125 keys in 40.73s


Built map K2 with 4034677 keys in 21.21s


Built map K3 with 3930113 keys in 20.21s


Built map K4 with 751569 keys in 8.47s


Built map K5 with 679023 keys in 9.76s


Maps built in total 111.17s
Applying backoff to test...


After K1 filled: 0.3674


After K2 filled: 0.6564


After K3 filled: 0.8159


After K4 filled: 0.9516


After K5 filled: 0.9566
Falling back to identity for 45921 rows
Backoff applied in 19.42s; coverage 100.00%
Building submission.csv ...




Saved submission.csv with shape (1059191, 2)
