# Text Normalization Challenge (English) - Medal Plan and Experiment Log

## Goal
- WIN A MEDAL (>= Bronze, target Silver+). Metric: Accuracy.

## High-level Plan
1. Data understanding & EDA
   - Load train/test and sample submission.
   - Inspect schema, sizes, and example rows.
2. Baselines
   - Simple frequency/lookup baseline: memorize most frequent normalized form for each token/phrase class.
   - Identity for <self>/sil classes if present.
3. Modeling
   - Per-class models:
     - Rule-based/regex for numbers, dates, times, currencies, measurements.
     - Backoff to CatBoost/LightGBM per class on engineered features if needed.
   - Sequence labeling or pointer-generator style fallback if data format supports it.
4. CV Strategy
   - Robust split preserving distribution by semi-random or group if provided (e.g., sentence id).
   - Early small iterations; log per-class accuracy.
5. Inference & Submission
   - Generate predictions on test.
   - Save to `submission.csv` in required format.

## Experiment Log
- [T0] Setup & EDA: load zipped csvs, inspect columns.
- [Next] Build trivial identity/majority baseline and evaluate via CV if possible.
- [Then] Add per-class rules for NUM, DATE, TIME, MONEY, MEASURE, etc.
- [Iterate] Error analysis -> add rules -> re-evaluate.

## Notes
- Always log progress with timestamps and fold indices.
- Keep runs short; interrupt long ones and iterate.
- Ask expert review after plan, EDA, baseline, and if score stalls.

In [1]:
import os, sys, time, json, gc
from datetime import datetime
import pandas as pd
import numpy as np

def log(msg):
    ts = datetime.now().strftime('%Y-%m-%d %H:%M:%S')
    print(f"[{ts}] {msg}", flush=True)

log("Listing repository files and sizes...")
for f in sorted(os.listdir('.')):
    try:
        sz = os.path.getsize(f)
    except Exception:
        sz = -1
    print(f" - {f:30s} {sz/1e6:8.2f} MB")

log("Reading heads of CSVs (from zip) to inspect schema...")
def read_head(path, n=5):
    try:
        df = pd.read_csv(path, nrows=n)
        log(f"Loaded {path}: shape={df.shape}; columns={list(df.columns)}")
        print(df.head(3))
        return df
    except Exception as e:
        log(f"Failed to read {path}: {e}")
        return None

train_head = read_head('en_train.csv.zip', n=5)
test_head = read_head('en_test_2.csv.zip', n=5)
sub_head = read_head('en_sample_submission_2.csv.zip', n=5)

log("Quick class distribution snapshot from a small sample (if class column exists)...")
try:
    sample_rows = 200000
    df_sample = pd.read_csv('en_train.csv.zip', nrows=sample_rows)
    if 'class' in df_sample.columns:
        print(df_sample['class'].value_counts().head(20))
    else:
        print("No 'class' column in sample.")
    del df_sample; gc.collect()
except Exception as e:
    log(f"Skipping class distribution sample: {e}")

log("Done initial EDA.")

[2025-09-08 20:55:55] Listing repository files and sizes...


 - agent_metadata                     0.00 MB
 - description.md                     0.00 MB
 - docker_run.log                     0.02 MB
 - en_sample_submission_2.csv.zip    17.45 MB
 - en_test_2.csv.zip                 15.46 MB
 - en_train.csv.zip                 291.83 MB
 - main.ipynb                         0.00 MB
 - requirements.txt                   0.00 MB
 - task.txt                           0.00 MB
[2025-09-08 20:55:55] Reading heads of CSVs (from zip) to inspect schema...


[2025-09-08 20:55:55] Loaded en_train.csv.zip: shape=(5, 5); columns=['sentence_id', 'token_id', 'class', 'before', 'after']


   sentence_id  token_id  class         before          after
0            0         0  PLAIN  Brillantaisia  Brillantaisia
1            0         1  PLAIN             is             is
2            0         2  PLAIN              a              a
[2025-09-08 20:55:55] Loaded en_test_2.csv.zip: shape=(5, 3); columns=['sentence_id', 'token_id', 'before']


   sentence_id  token_id    before
0            0         0     Rocky
1            0         1  Mountain
2            0         2      News
[2025-09-08 20:55:55] Loaded en_sample_submission_2.csv.zip: shape=(5, 2); columns=['id', 'after']


    id     after
0  0_0     Rocky
1  0_1  Mountain
2  0_2      News
[2025-09-08 20:55:55] Quick class distribution snapshot from a small sample (if class column exists)...


class
PLAIN         146914
PUNCT          38555
DATE            6060
LETTERS         3047
CARDINAL        2618
VERBATIM        1645
MEASURE          300
ORDINAL          233
DECIMAL          210
MONEY             98
DIGIT             95
ELECTRONIC        90
TELEPHONE         74
TIME              29
FRACTION          26
ADDRESS            6
Name: count, dtype: int64


[2025-09-08 20:55:55] Done initial EDA.


In [2]:
import pandas as pd
from datetime import datetime

def log(msg):
    ts = datetime.now().strftime('%Y-%m-%d %H:%M:%S')
    print(f"[{ts}] {msg}", flush=True)

log("Loading test set...")
test_path = 'en_test_2.csv.zip'
df_test = pd.read_csv(test_path)
log(f"Test shape: {df_test.shape}; cols={list(df_test.columns)}")

log("Building identity baseline predictions (after = before)...")
pred = df_test[['sentence_id','token_id','before']].copy()
pred['id'] = pred['sentence_id'].astype(str) + '_' + pred['token_id'].astype(str)
pred.rename(columns={'before':'after'}, inplace=True)

sub = pred[['id','after']].copy()
out_path = 'submission.csv' 
sub.to_csv(out_path, index=False)
log(f"Wrote {out_path} with shape {sub.shape}")
print(sub.head(10))

[2025-09-08 20:59:01] Loading test set...


[2025-09-08 20:59:01] Test shape: (993465, 3); cols=['sentence_id', 'token_id', 'before']


[2025-09-08 20:59:01] Building identity baseline predictions (after = before)...


[2025-09-08 20:59:03] Wrote submission.csv with shape (993465, 2)


    id      after
0  0_0      Rocky
1  0_1   Mountain
2  0_2       News
3  0_3          .
4  1_0          "
5  1_1       U.S.
6  1_2  Decennial
7  1_3     Census
8  1_4          "
9  1_5          .


In [None]:
import pandas as pd
import numpy as np
from collections import Counter, defaultdict
from sklearn.model_selection import GroupKFold
from datetime import datetime

def log(msg):
    print(f"[{datetime.now().strftime('%Y-%m-%d %H:%M:%S')}] {msg}", flush=True)

train_path = 'en_train.csv.zip'
log('Loading full train...')
df_train = pd.read_csv(train_path)
log(f"Train shape: {df_train.shape}; cols={list(df_train.columns)}")

def build_majority_map(df):
    # For each 'before', choose most frequent 'after'
    counts = df.groupby(['before','after']).size().reset_index(name='cnt')
    idx = counts.groupby('before')['cnt'].idxmax()
    best = counts.loc[idx, ['before','after','cnt']]
    mapping = dict(zip(best['before'], best['after']))
    return mapping, best['cnt'].sum(), len(mapping)

def infer_with_map(series_before, mapping):
    # Vectorized map with fallback to identity
    return series_before.map(mapping).fillna(series_before)

log('5-fold GroupKFold CV by sentence_id...')
gkf = GroupKFold(n_splits=5)
groups = df_train['sentence_id'].values
fold_acc = []
for fold, (tr_idx, va_idx) in enumerate(gkf.split(df_train, groups=groups), 1):
    t0 = datetime.now()
    dtr = df_train.iloc[tr_idx]
    dva = df_train.iloc[va_idx]
    mapping, total_cnt, uniq = build_majority_map(dtr[['before','after']])
    dva_pred = infer_with_map(dva['before'], mapping)
    acc = (dva_pred.values == dva['after'].values).mean()
    fold_acc.append(acc)
    dt = (datetime.now()-t0).total_seconds()
    log(f"Fold {fold}: acc={acc:.6f}, uniq_before={uniq}, build_cnt_sum={total_cnt}, time={dt:.2f}s")
log(f"CV mean acc: {np.mean(fold_acc):.6f} ± {np.std(fold_acc):.6f}")

# Fit on full train and generate test predictions
log('Fitting majority map on full train...')
major_map, total_cnt, uniq = build_majority_map(df_train[['before','after']])
log(f"Major map size={uniq}, total_cnt={total_cnt}")

log('Loading test and predicting...')
df_test = pd.read_csv('en_test_2.csv.zip')
pred_after = infer_with_map(df_test['before'], major_map)
sub = pd.DataFrame({
    'id': df_test['sentence_id'].astype(str) + '_' + df_test['token_id'].astype(str),
    'after': pred_after
})
sub.to_csv('submission.csv', index=False)
log(f"Wrote submission.csv with shape {sub.shape}")
print(sub.head(10))

[2025-09-08 20:59:57] Loading full train...
