# Quick TF-IDF + Ridge Baseline for Patent Phrase Matching

## Objective
Implement a fast baseline using TF-IDF features on anchor/target/context texts + Ridge regression to generate submission.csv with OOF Pearson ~0.83-0.85 (above median). This unblocks leaderboard while GPU/CPU env fixes proceed in parallel.

## Strategy
- **Features:** TF-IDF on combined 'anchor + context' and 'target + context' texts (separate vectorizers), ngram=(1,3), max_features=5000.
- **Model:** Ridge regression (alpha=1.0) for each fold.
- **CV:** 5-fold GroupKFold grouped by 'anchor' to prevent leakage.
- **Evaluation:** OOF Pearson correlation overall.
- **Inference:** Average predictions across folds for test set.
- **Output:** submission.csv with 'id' and 'score' columns.

## Expected Performance
OOF ~0.83-0.85; submit if >=0.83 to secure above-median position. Later replace with DeBERTa cross-encoder for medal push (>=0.85 OOF).

## Workflow
1. Imports (pandas, sklearn, scipy).
2. Load data, prepare texts (anchor_context = anchor + ' ' + context, similarly for target).
3. Fit TF-IDF vectorizers on train, transform train/test.
4. CV loop: GroupKFold, train Ridge on fold, predict OOF, compute fold/overall Pearson.
5. Test predictions: average fold preds, clip [0,1], save submission.csv.

Next: Insert code cell for implementation and execution.

In [2]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import Ridge
from sklearn.model_selection import GroupKFold
from scipy.stats import pearsonr
from scipy.sparse import hstack

# Load data
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
print(f'Train shape: {train.shape}, Test shape: {test.shape}')

# Prepare texts: combine anchor + context, target + context
train['anchor_context'] = train['anchor'] + ' ' + train['context']
train['target_context'] = train['target'] + ' ' + train['context']
test['anchor_context'] = test['anchor'] + ' ' + test['context']
test['target_context'] = test['target'] + ' ' + test['context']

# TF-IDF vectorizers
tfidf_anchor = TfidfVectorizer(ngram_range=(1,3), max_features=5000)
tfidf_target = TfidfVectorizer(ngram_range=(1,3), max_features=5000)

# Fit and transform
X_anchor_train = tfidf_anchor.fit_transform(train['anchor_context'])
X_target_train = tfidf_target.fit_transform(train['target_context'])
X_train = hstack([X_anchor_train, X_target_train]).tocsr()

X_anchor_test = tfidf_anchor.transform(test['anchor_context'])
X_target_test = tfidf_target.transform(test['target_context'])
X_test = hstack([X_anchor_test, X_target_test]).tocsr()

y_train = train['score'].values

# CV setup
gkf = GroupKFold(n_splits=5)
oof = np.zeros(len(train))
test_preds = np.zeros((5, len(test)))

for fold, (tr_idx, val_idx) in enumerate(gkf.split(X_train, y_train, groups=train['anchor'])):
    print(f'Training fold {fold+1}/5...')
    X_tr = X_train[tr_idx]
    X_val = X_train[val_idx]
    y_tr, y_val = y_train[tr_idx], y_train[val_idx]
    
    model = Ridge(alpha=1.0)
    model.fit(X_tr, y_tr)
    
    oof_val = model.predict(X_val)
    oof[val_idx] = oof_val
    test_preds[fold] = model.predict(X_test)
    
    fold_pearson = pearsonr(oof_val, y_val)[0]
    print(f'Fold {fold+1} Pearson: {fold_pearson:.4f}')

# Overall OOF
oof_pearson = pearsonr(oof, y_train)[0]
print(f'OOF Pearson: {oof_pearson:.4f}')

# Test predictions: average folds, clip [0,1]
test['score'] = np.mean(test_preds, axis=0)
test['score'] = np.clip(test['score'], 0, 1)

# Save submission
submission = test[['id', 'score']]
submission.to_csv('submission.csv', index=False)
print('Submission saved to submission.csv')
print(submission.head())

Train shape: (32825, 5), Test shape: (3648, 4)


Training fold 1/5...
Fold 1 Pearson: 0.2530
Training fold 2/5...


Fold 2 Pearson: 0.2537
Training fold 3/5...
Fold 3 Pearson: 0.2489
Training fold 4/5...


Fold 4 Pearson: 0.2711
Training fold 5/5...
Fold 5 Pearson: 0.2283
OOF Pearson: 0.2509
Submission saved to submission.csv
                 id     score
0  2a988c7d98568627  0.291179
1  75a3ae03b26e2f7e  0.236453
2  0126c870aede9858  0.419978
3  2cf662e1cc9b354e  0.144750
4  8dfee5874de0b408  0.298207
