# Plan to Medal: Detecting Insults in Social Commentary

Objectives:
- Build strong TF-IDF + linear model baseline with robust CV.
- Optimize for ROC-AUC using stratified k-folds.
- Iterate quickly; log progress; avoid long stalls.

Workflow:
1) Data load + quick EDA
   - Inspect columns, sizes, nulls, label balance.
   - Text field likely `Comment`; target `Insult`.

2) Validation design
   - StratifiedKFold(n_splits=5, shuffle=True, random_state=42).
   - Use cross_val_predict or manual loop with consistent logging.

3) Baseline model
   - TfidfVectorizer:
     - word ngrams (1,2) + char ngrams (3,5) union via FeatureUnion or hstack.
     - min_df tuned (e.g., 2), max_features large (e.g., 200k total).
     - sublinear_tf=True, lowercase=True, strip accents='unicode'.
     - stop_words='english' for word-level; none for char-level.
   - Model: LogisticRegression(saga, l2, C tuned), or LinearSVC with CalibratedClassifierCV, or SGDClassifier(log).
   - Start with LogisticRegression (fast, probabilistic); fallback SGD if memory/time issues.

4) Preprocessing
   - Normalize URLs, mentions, numbers.
   - Keep punctuation; char-ngrams are robust to misspellings/slurs.

5) Hyperparameter tuning (targeted)
   - Grid/Random small: C in [0.5, 1, 2, 4]; min_df in [1, 2, 3]; word n-grams (1,2)/(1,3); char (3,5)/(3,6).
   - Use CV AUC to select.

6) Ensembling
   - Blend word-only and char-only logistic models (weighted average).
   - Optionally add LinearSVC calibrated; simple average.

7) Train on full data with best config; predict test; save submission.csv.

8) Logging and efficiency
   - Print fold indices, shapes, elapsed time per fold.
   - Use sparse matrices; avoid dense conversions.

Medal Strategy:
- This competition historically favors char+word TF-IDF + linear models. Aim for ≥0.80 AUC quickly, then push with tuning/ensembling to ≥0.82.

Open Questions for Experts:
- Best char n-gram range and min_df for this dataset?
- Preference between LogisticRegression vs Calibrated LinearSVC vs SGD for top AUC here?
- Any specific pre-processing (e.g., aggressive lowercasing, de-emojify) that boosts AUC notably?
- Is ensembling two/three linear models worth it on this dataset?

In [1]:
# Imports, data load, and quick EDA
import os, sys, time, math, gc, random
import numpy as np
import pandas as pd
from sklearn.model_selection import StratifiedKFold
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.metrics import roc_auc_score
from scipy.sparse import hstack, csr_matrix
import re

pd.set_option('display.max_colwidth', 200)

train_path = 'train.csv'
test_path = 'test.csv'
assert os.path.exists(train_path) and os.path.exists(test_path), 'Missing train/test files'

train = pd.read_csv(train_path)
test = pd.read_csv(test_path)

print('Train shape:', train.shape)
print('Test shape:', test.shape)
print('Train columns:', train.columns.tolist())
print('Test columns:', test.columns.tolist())

# Infer key columns
target_col = 'Insult' if 'Insult' in train.columns else [c for c in train.columns if c.lower()=='insult'][0]
text_col_candidates = [c for c in train.columns if c.lower() in ['comment','comment_text','text']]
assert len(text_col_candidates) >= 1, f'Could not find text column candidates in {train.columns.tolist()}'
text_col = text_col_candidates[0]
id_col_candidates = [c for c in train.columns if c.lower() in ['id','id_str','comment_id']]
test_id_col_candidates = [c for c in test.columns if c.lower() in ['id','id_str','comment_id']]
id_col = id_col_candidates[0] if len(id_col_candidates)>0 else 'id'
if id_col not in train.columns:
    # create a synthetic id for train to keep consistent shape; test must have an id or we create one
    train[id_col] = np.arange(len(train))
if id_col not in test.columns:
    test[id_col] = np.arange(len(test))

print('Using columns -> id:', id_col, '| text:', text_col, '| target:', target_col)
print('Target distribution:')
print(train[target_col].value_counts(normalize=True))
print('Nulls in text:', train[text_col].isna().sum(), ' / ', len(train))

# Basic preprocessing utilities
url_re = re.compile(r'https?://\S+|www\.\S+')
user_re = re.compile(r'@\w+')
num_re = re.compile(r'\b\d+[\d,\.]*\b')

def normalize_text(s: str) -> str:
    if not isinstance(s, str):
        return ''
    s = url_re.sub(' <url> ', s)
    s = user_re.sub(' <user> ', s)
    s = num_re.sub(' <num> ', s)
    return s

# Create meta features (computed before lowercasing; stored separately) if needed later
def compute_meta_features(series: pd.Series) -> pd.DataFrame:
    texts = series.fillna('')
    lens_char = texts.str.len().astype(np.float32)
    lens_word = texts.str.split().map(len).astype(np.float32)
    caps = texts.map(lambda x: sum(1 for ch in x if ch.isupper())).astype(np.float32)
    frac_caps = (caps / (lens_char.clip(lower=1))).astype(np.float32)
    excl = texts.str.count('!').astype(np.float32)
    ques = texts.str.count('\?').astype(np.float32)
    return pd.DataFrame({'len_char': lens_char, 'len_word': lens_word, 'frac_caps': frac_caps, 'excl': excl, 'ques': ques})

# Apply normalization to text fields
train_text = train[text_col].fillna('').map(normalize_text)
test_text = test[text_col].fillna('').map(normalize_text)

y = train[target_col].values.astype(int)
train_ids = train[id_col].values
test_ids = test[id_col].values

print('Sample normalized text:')
print(train_text.head(3).tolist())
print('Setup complete.')

Train shape: (3947, 3)
Test shape: (2647, 2)
Train columns: ['Insult', 'Date', 'Comment']
Test columns: ['Date', 'Comment']
Using columns -> id: id | text: Comment | target: Insult
Target distribution:
Insult
0    0.734229
1    0.265771
Name: proportion, dtype: float64
Nulls in text: 0  /  3947
Sample normalized text:
['"You fuck your dad."', '"i really don\'t understand your point.\\xa0 It seems that you are mixing apples and oranges."', '"A\\\\xc2\\\\xa0majority of Canadians can and has been wrong before now and will be again.\\\\n\\\\nUnless you\'re supportive of the idea that nothing is full proof or perfect so you take your chances and if we should inadvertently kill your son or daughter then them\'s the breaks and we can always regard you as collateral damage like in wartime - and sorry, but\\\\xc2\\\\xa0the cheques in the mail. "']
Setup complete.


In [2]:
# 5-fold CV for word and char TF-IDF + LogisticRegression; blend OOF; train full and create submission
from sklearn.model_selection import StratifiedKFold
from sklearn.pipeline import Pipeline
from sklearn.metrics import roc_auc_score
import numpy as np
import pandas as pd
import time

seed = 42
n_splits = 5

def build_word_pipeline():
    return Pipeline([
        ('tfidf', TfidfVectorizer(analyzer='word', ngram_range=(1,2), min_df=2, lowercase=True, strip_accents='unicode', sublinear_tf=True, max_features=100_000)),
        ('lr', LogisticRegression(solver='saga', penalty='l2', C=2.0, max_iter=2000, n_jobs=-1, random_state=seed))
    ])

def build_char_pipeline():
    return Pipeline([
        ('tfidf', TfidfVectorizer(analyzer='char', ngram_range=(3,5), min_df=2, lowercase=True, strip_accents='unicode', sublinear_tf=True, max_features=200_000)),
        ('lr', LogisticRegression(solver='saga', penalty='l2', C=2.0, max_iter=2000, n_jobs=-1, random_state=seed))
    ])

def get_oof_predictions(texts, y, pipeline_builder, name):
    skf = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=seed)
    oof = np.zeros(len(texts), dtype=np.float32)
    fold_aucs = []
    start_all = time.time()
    for fold, (trn_idx, val_idx) in enumerate(skf.split(texts, y), 1):
        t0 = time.time()
        X_tr = texts.iloc[trn_idx]
        y_tr = y[trn_idx]
        X_va = texts.iloc[val_idx]
        model = pipeline_builder()
        model.fit(X_tr, y_tr)
        preds = model.predict_proba(X_va)[:,1]
        oof[val_idx] = preds
        auc = roc_auc_score(y[val_idx], preds)
        fold_aucs.append(auc)
        print(f'[{name}] Fold {fold}/{n_splits} AUC: {auc:.5f} | trn {len(trn_idx)} va {len(val_idx)} | elapsed {time.time()-t0:.2f}s', flush=True)
    total_auc = roc_auc_score(y, oof)
    print(f'[{name}] OOF AUC: {total_auc:.5f} | mean {np.mean(fold_aucs):.5f} +- {np.std(fold_aucs):.5f} | total {time.time()-start_all:.2f}s', flush=True)
    return oof, fold_aucs

# Compute OOF for word and char
oof_word, aucs_word = get_oof_predictions(train_text, y, build_word_pipeline, 'WORD')
oof_char, aucs_char = get_oof_predictions(train_text, y, build_char_pipeline, 'CHAR')

# Search best blend weight
weights = np.linspace(0.0, 1.0, 101)
best_w, best_auc = 0.0, -1.0
for w in weights:
    blend = w * oof_char + (1.0 - w) * oof_word
    auc = roc_auc_score(y, blend)
    if auc > best_auc:
        best_auc, best_w = auc, w
print(f'Best blend weight (char weight): {best_w:.2f} | OOF AUC: {best_auc:.5f}', flush=True)

# Fit full models and predict test
print('Training full models on all data...', flush=True)
model_word_full = build_word_pipeline()
model_char_full = build_char_pipeline()
t0 = time.time()
model_word_full.fit(train_text, y)
print(f'Word model trained in {time.time()-t0:.2f}s', flush=True)
t1 = time.time()
model_char_full.fit(train_text, y)
print(f'Char model trained in {time.time()-t1:.2f}s', flush=True)

pred_word_test = model_word_full.predict_proba(test_text)[:,1]
pred_char_test = model_char_full.predict_proba(test_text)[:,1]
pred_blend_test = best_w * pred_char_test + (1.0 - best_w) * pred_word_test

# Create submission
sub = pd.DataFrame({
    'id': test_ids,
    'Insult': pred_blend_test
})
sub_path = 'submission.csv'
sub.to_csv(sub_path, index=False)
print('Saved submission to', sub_path, 'with shape', sub.shape, 'and head:')
print(sub.head())

[WORD] Fold 1/5 AUC: 0.88871 | trn 3157 va 790 | elapsed 0.50s


[WORD] Fold 2/5 AUC: 0.88661 | trn 3157 va 790 | elapsed 0.42s


[WORD] Fold 3/5 AUC: 0.89176 | trn 3158 va 789 | elapsed 0.36s


[WORD] Fold 4/5 AUC: 0.88418 | trn 3158 va 789 | elapsed 0.50s


[WORD] Fold 5/5 AUC: 0.88155 | trn 3158 va 789 | elapsed 0.47s


[WORD] OOF AUC: 0.88621 | mean 0.88656 +- 0.00353 | total 2.27s


[CHAR] Fold 1/5 AUC: 0.90625 | trn 3157 va 790 | elapsed 4.33s


[CHAR] Fold 2/5 AUC: 0.90645 | trn 3157 va 790 | elapsed 3.52s


[CHAR] Fold 3/5 AUC: 0.90751 | trn 3158 va 789 | elapsed 3.17s


[CHAR] Fold 4/5 AUC: 0.90651 | trn 3158 va 789 | elapsed 4.30s


[CHAR] Fold 5/5 AUC: 0.89818 | trn 3158 va 789 | elapsed 4.31s


[CHAR] OOF AUC: 0.90491 | mean 0.90498 +- 0.00343 | total 19.64s


Best blend weight (char weight): 0.86 | OOF AUC: 0.90546


Training full models on all data...


Word model trained in 0.44s


Char model trained in 4.28s


Saved submission to submission.csv with shape (2647, 2) and head:
   id    Insult
0   0  0.040317
1   1  0.122544
2   2  0.718967
3   3  0.190829
4   4  0.108523


In [3]:
# Re-save submission with correct ID column (Date) expected by autograder
import pandas as pd

test_df = pd.read_csv('test.csv')
assert 'Date' in test_df.columns, 'Expected Date column in test.csv'

# Use existing predictions in memory if available; otherwise, rebuild quickly
try:
    pred_word_test, pred_char_test, best_w
except NameError:
    # Fallback: rebuild models and predictions (should be quick on this dataset)
    model_word_full = build_word_pipeline()
    model_char_full = build_char_pipeline()
    model_word_full.fit(train_text, y)
    model_char_full.fit(train_text, y)
    pred_word_test = model_word_full.predict_proba(test_text)[:,1]
    pred_char_test = model_char_full.predict_proba(test_text)[:,1]
    # Default to char weight from earlier advice if best_w not present
    best_w = 0.65

pred_blend_test = best_w * pred_char_test + (1.0 - best_w) * pred_word_test

sub_fix = pd.DataFrame({
    'Date': test_df['Date'],
    'Insult': pred_blend_test
})
sub_fix.to_csv('submission.csv', index=False)
print('Resaved submission.csv with columns:', sub_fix.columns.tolist(), 'and shape:', sub_fix.shape)
print(sub_fix.head())

Resaved submission.csv with columns: ['Date', 'Insult'] and shape: (2647, 2)
              Date    Insult
0              NaN  0.040317
1  20120618222256Z  0.122544
2  20120618213617Z  0.718967
3              NaN  0.190829
4  20120620003825Z  0.108523


In [4]:
# Re-save submission with columns matching sample_submission_null.csv: ['Insult','Date','Comment']
import pandas as pd
test_df = pd.read_csv('test.csv')
assert all(c in test_df.columns for c in ['Date','Comment'])

try:
    pred_blend_test
except NameError:
    # Fallback safety: rebuild predictions
    model_word_full = build_word_pipeline()
    model_char_full = build_char_pipeline()
    model_word_full.fit(train_text, y)
    model_char_full.fit(train_text, y)
    pred_word_test = model_word_full.predict_proba(test_text)[:,1]
    pred_char_test = model_char_full.predict_proba(test_text)[:,1]
    best_w = 0.65 if 'best_w' not in globals() else best_w
    pred_blend_test = best_w * pred_char_test + (1.0 - best_w) * pred_word_test

sub3 = pd.DataFrame({
    'Insult': pred_blend_test,
    'Date': test_df['Date'],
    'Comment': test_df['Comment']
})
sub3.to_csv('submission.csv', index=False)
print('submission.csv columns:', sub3.columns.tolist(), 'shape:', sub3.shape)
print(sub3.head(3))

submission.csv columns: ['Insult', 'Date', 'Comment'] shape: (2647, 3)
     Insult             Date  \
0  0.040317              NaN   
1  0.122544  20120618222256Z   
2  0.718967  20120618213617Z   

                                                                                                                                                                                                   Comment  
0  "THE DRUDGE REPORT\\n\\n\\n\\nYou won't see this story on foxfag forum because they suck bIacks and gay 0bama all the way to the crack.\\n\\n\\n\\n\\n\\n\\n\\nOn Tuesday Rep. Darrell Issa, chairma...  
1                        "@ian21\xa0"Roger Clemens is the fucking man, and never did any fucking steroids because he is fucking awesome. Did you all misremember Roger's incredibleness?" - Roger Clemens"  
2                                                                                          "Agree with Alan you are an extremest idiot.   You are not American, we native americans are, 