# KDD Cup 2022 ESCI Challenge - Data Preprocessing

Bu notebook, Shopping Queries Dataset'ini preprocessing yapmak için kullanılır ve comprehensive ML pipeline structure'a uygun şekilde tasarlanmıştır.

## 📋 Project Structure Integration

Bu notebook aşağıdaki project structure'ın bir parçasıdır:

```
kddcup/
├── data/
│   ├── raw/                     # ✅ Raw dataset files
│   ├── processed/               # ✅ Clean data for each task  
│   └── features/                # ✅ Generated features
├── src/
│   ├── config/config.py         # ✅ Configuration settings
│   ├── data/data_loader.py      # ⚡ Data loading pipeline
│   ├── features/base_features.py # ⚡ Feature engineering
│   ├── models/lgb_ranker.py     # ⚡ LightGBM models
│   └── evaluation/metrics.py    # ⚡ Evaluation metrics
├── notebooks/                   # 📓 Current location
├── experiments/                 # 🧪 Experiment tracking
└── results/                     # 📊 Model outputs
```

## 🎯 Preprocessing Pipeline

Bu notebook şu adımları kapsar:

1. **Data Loading & Merging**: KDD Cup specification'a uygun data loading
2. **Data Cleaning**: Missing values, duplicates ve validation
3. **Feature Engineering**: Basic text features ve similarity metrics
4. **Task-Specific Datasets**: 3 farklı task için dataset hazırlama
5. **LightGBM Baseline Model**: İlk model training ve evaluation
6. **Quality Checks**: Data validation ve integrity checks
7. **Data Export**: Processed data'yı structure'a uygun kaydetme

## 📊 Development Phase

**Current Phase**: Data Preprocessing + Baseline Model (Target: ~0.2-0.3 NDCG)

**Features Implemented**:
- Basic text features (length, word count, unique words)
- Query-product similarity (word overlap, Jaccard similarity)
- Brand/color matching features
- Target encoding (ESCI score mapping)

**Next Phases**:
- Phase 1: Advanced text features (TF-IDF, embeddings) → Target: 0.4-0.5 NDCG
- Phase 2: Deep learning features + ensemble methods → Target: 0.6+ NDCG

## 🎯 Outputs

Bu notebook çalıştırıldıktan sonra şu dosyalar oluşturulacak:

**Processed Data**:
- `data/processed/task_1/` - Query-Product Ranking dataset
- `data/processed/task_2/` - Multi-class Classification dataset  
- `data/processed/task_3/` - Substitute Identification dataset

**Features**:
- `data/processed/feature_metadata.pkl` - Feature definitions
- `data/processed/dataset_summary.csv` - Dataset statistics

**Model Results**:
- LightGBM baseline model performance
- Feature importance analysis
- NDCG evaluation metrics

## 1. Import Required Libraries

In [None]:
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots

# Text processing libraries
import re
import string
from collections import Counter
import warnings

# System and file operations
import os
import sys
from pathlib import Path
import pickle

# Add src to path for config import
sys.path.append('../')
from src.config.config import Config

# Configure settings
warnings.filterwarnings('ignore')
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)
plt.style.use('default')
sns.set_palette("husl")

print("✅ Kütüphaneler başarıyla yüklendi!")
print(f"📁 Base directory: {Config.BASE_DIR}")
print(f"📊 Raw data directory: {Config.RAW_DATA_DIR}")
print(f"💾 Processed data directory: {Config.PROCESSED_DATA_DIR}")

## 2. Load Configuration and Setup Paths

In [None]:
# Create necessary directories
Config.create_directories()

# Define file paths
RAW_DATA_PATH = Config.RAW_DATA_DIR
PROCESSED_DATA_PATH = Config.PROCESSED_DATA_DIR

# Check if raw data files exist
required_files = {
    'examples': Config.EXAMPLES_FILE,
    'products': Config.PRODUCTS_FILE,
    'sources': Config.SOURCES_FILE
}

print("📋 Veri dosyası kontrolü:")
print("=" * 40)

missing_files = []
for name, file_path in required_files.items():
    if file_path.exists():
        file_size = file_path.stat().st_size / (1024*1024)  # MB
        print(f"✅ {name:10}: {file_path.name} ({file_size:.1f} MB)")
    else:
        print(f"❌ {name:10}: {file_path.name} - EKSIK!")
        missing_files.append(file_path)

if missing_files:
    print(f"\n⚠️  {len(missing_files)} dosya eksik!")
    print("Lütfen data/raw/ klasörüne şu dosyaları yerleştirin:")
    for file_path in missing_files:
        print(f"   - {file_path.name}")
    print("\nKDD Cup 2022 dataset'ini indirip yerleştirmeniz gerekiyor.")
else:
    print(f"\n🎉 Tüm veri dosyaları mevcut!")

# Create preprocessing functions
def print_section(title):
    """Helper function to print section headers"""
    print(f"\n{'='*60}")
    print(f"📊 {title}")
    print(f"{'='*60}")

def print_dataframe_info(df, name):
    """Helper function to print dataframe info"""
    print(f"\n{name} Dataset Info:")
    print(f"  Shape: {df.shape}")
    print(f"  Memory usage: {df.memory_usage(deep=True).sum() / 1024**2:.1f} MB")
    print(f"  Columns: {list(df.columns)}")

print("\n✅ Konfigürasyon hazır!")

## 3. Load Raw Data Files

In [None]:
print_section("RAW DATA LOADING")

# Load examples, products and sources - exactly as specified in KDD Cup documentation
print("📥 Loading datasets using official KDD Cup approach...")

import pandas as pd
df_examples = pd.read_parquet(Config.EXAMPLES_FILE)
df_products = pd.read_parquet(Config.PRODUCTS_FILE)
df_sources = pd.read_csv(Config.SOURCES_FILE)

print_dataframe_info(df_examples, "Examples")
print_dataframe_info(df_products, "Products")
print_dataframe_info(df_sources, "Sources")

print(f"\n✅ Tüm veri dosyaları başarıyla yüklendi!")
print(f"   📊 Examples: {len(df_examples):,} satır")
print(f"   🛍️  Products: {len(df_products):,} satır") 
print(f"   🔍 Sources: {len(df_sources):,} satır")

# Merge examples with products - exactly as specified
print("\n🔗 Merging examples with products...")
df_examples_products = pd.merge( 
    df_examples, 
    df_products, 
    how='left', 
    left_on=['product_locale','product_id'], 
    right_on=['product_locale', 'product_id']
)

print(f"Merged dataset shape: {df_examples_products.shape}")

# Check merge success
merge_success = df_examples_products['product_title'].notna().sum()
merge_total = len(df_examples_products)
print(f"Merge success rate: {merge_success/merge_total*100:.1f}% ({merge_success}/{merge_total})")

## 4. Explore Data Structure and Statistics

In [None]:
print_section("DATA STRUCTURE EXPLORATION")

# Examples dataset analysis
print("📊 EXAMPLES DATASET")
print("-" * 30)
print(f"Shape: {df_examples.shape}")
print(f"Columns: {list(df_examples.columns)}")
print("\nFirst 3 rows:")
display(df_examples.head(3))

print("\nData types:")
print(df_examples.dtypes)

print("\nMissing values:")
missing_examples = df_examples.isnull().sum()
print(missing_examples[missing_examples > 0])

print("\nUnique values per column:")
for col in df_examples.columns:
    unique_count = df_examples[col].nunique()
    print(f"  {col}: {unique_count:,}")

# ESCI label distribution
print("\n📈 ESCI Label Distribution:")
esci_dist = df_examples['esci_label'].value_counts().sort_index()
print(esci_dist)

# Version and split distribution  
print("\n📋 Version Distribution:")
print("Small version:", df_examples['small_version'].sum())
print("Large version:", df_examples['large_version'].sum())

print("\n📋 Split Distribution:")
print(df_examples['split'].value_counts())

# Product locale distribution
print("\n🌍 Product Locale Distribution:")
locale_dist = df_examples['product_locale'].value_counts()
print(locale_dist)

In [None]:
# Products dataset analysis
print("\n\n🛍️ PRODUCTS DATASET")
print("-" * 30)
print(f"Shape: {df_products.shape}")
print(f"Columns: {list(df_products.columns)}")

print("\nFirst 3 rows:")
display(df_products.head(3))

print("\nMissing values:")
missing_products = df_products.isnull().sum()
print(missing_products[missing_products > 0])

print("\nUnique values per column:")
for col in df_products.columns:
    unique_count = df_products[col].nunique()
    print(f"  {col}: {unique_count:,}")

# Text field statistics
text_fields = ['product_title', 'product_description', 'product_bullet_point']
print(f"\n📝 Text Field Statistics:")
for field in text_fields:
    if field in df_products.columns:
        non_null = df_products[field].notna().sum()
        avg_length = df_products[field].str.len().mean()
        print(f"  {field}:")
        print(f"    Non-null: {non_null:,} ({non_null/len(df_products)*100:.1f}%)")
        print(f"    Avg length: {avg_length:.1f} chars")

# Sources dataset analysis  
print("\n\n🔍 SOURCES DATASET")
print("-" * 30)
print(f"Shape: {df_sources.shape}")
print(f"Columns: {list(df_sources.columns)}")

print("\nFirst 5 rows:")
display(df_sources.head())

print("\nSource distribution:")
if 'source' in df_sources.columns:
    source_dist = df_sources['source'].value_counts()
    print(source_dist)

## 5. Clean and Validate Data

In [None]:
print_section("DATA CLEANING AND VALIDATION")

def clean_text(text):
    """Clean text data"""
    if pd.isna(text):
        return ""
    
    # Convert to string and strip
    text = str(text).strip()
    
    # Remove extra whitespaces
    text = re.sub(r'\s+', ' ', text)
    
    return text

def validate_esci_labels(df):
    """Validate ESCI labels"""
    valid_labels = {'E', 'S', 'C', 'I'}
    invalid_labels = set(df['esci_label'].unique()) - valid_labels
    
    if invalid_labels:
        print(f"⚠️  Invalid ESCI labels found: {invalid_labels}")
        return False
    else:
        print("✅ All ESCI labels are valid")
        return True

# Clean examples dataset
print("🧹 Examples dataset temizleniyor...")
df_examples_clean = df_examples.copy()

# Check for duplicates
duplicates = df_examples_clean.duplicated().sum()
print(f"Duplicates found: {duplicates}")

if duplicates > 0:
    df_examples_clean = df_examples_clean.drop_duplicates()
    print(f"✅ {duplicates} duplicate removed")

# Validate ESCI labels
validate_esci_labels(df_examples_clean)

# Clean query text
df_examples_clean['query'] = df_examples_clean['query'].apply(clean_text)

# Remove empty queries
empty_queries = df_examples_clean['query'].str.len() == 0
if empty_queries.sum() > 0:
    print(f"⚠️  {empty_queries.sum()} empty queries found, removing...")
    df_examples_clean = df_examples_clean[~empty_queries]

print(f"Examples dataset: {len(df_examples)} → {len(df_examples_clean)} rows")

# Clean products dataset
print("\n🧹 Products dataset temizleniyor...")
df_products_clean = df_products.copy()

# Check for duplicates  
duplicates = df_products_clean.duplicated(subset=['product_id', 'product_locale']).sum()
print(f"Product duplicates found: {duplicates}")

if duplicates > 0:
    df_products_clean = df_products_clean.drop_duplicates(subset=['product_id', 'product_locale'])
    print(f"✅ {duplicates} product duplicates removed")

# Clean text fields
text_fields = ['product_title', 'product_description', 'product_bullet_point', 'product_brand']
for field in text_fields:
    if field in df_products_clean.columns:
        df_products_clean[field] = df_products_clean[field].apply(clean_text)

# Fill missing text fields with empty string
df_products_clean[text_fields] = df_products_clean[text_fields].fillna("")

print(f"Products dataset: {len(df_products)} → {len(df_products_clean)} rows")

# Clean sources dataset
print("\n🧹 Sources dataset temizleniyor...")
df_sources_clean = df_sources.copy()

# Check for duplicates
duplicates = df_sources_clean.duplicated(subset=['query_id']).sum()
print(f"Source duplicates found: {duplicates}")

if duplicates > 0:
    df_sources_clean = df_sources_clean.drop_duplicates(subset=['query_id'])
    print(f"✅ {duplicates} source duplicates removed")

print(f"Sources dataset: {len(df_sources)} → {len(df_sources_clean)} rows")

print("\n✅ Veri temizleme tamamlandı!")

## 6. Feature Engineering for Text Data

In [None]:
print_section("FEATURE ENGINEERING")

# Use the merged dataset from previous step
print("⚙️ Using merged df_examples_products for feature engineering...")
df_master = df_examples_products.copy()

print(f"Master dataset shape: {df_master.shape}")

def create_basic_text_features(df):
    """Create basic text features"""
    features = df.copy()
    
    # Query features
    features['query_len'] = features['query'].str.len()
    features['query_word_count'] = features['query'].str.split().str.len()
    features['query_unique_words'] = features['query'].apply(lambda x: len(set(str(x).lower().split())))
    
    # Product title features
    features['title_len'] = features['product_title'].str.len()
    features['title_word_count'] = features['product_title'].str.split().str.len()
    features['title_unique_words'] = features['product_title'].apply(lambda x: len(set(str(x).lower().split())))
    
    # Product description features
    features['description_len'] = features['product_description'].str.len()
    features['description_word_count'] = features['product_description'].str.split().str.len()
    
    # Brand and color features
    features['has_brand'] = (features['product_brand'].str.len() > 0).astype(int)
    features['has_color'] = (features['product_color'].str.len() > 0).astype(int)
    
    return features

def create_similarity_features(df):
    """Create query-product similarity features"""
    features = df.copy()
    
    # Exact matches
    features['query_in_title'] = features.apply(
        lambda x: 1 if str(x['query']).lower() in str(x['product_title']).lower() else 0, axis=1
    )
    
    features['title_in_query'] = features.apply(
        lambda x: 1 if str(x['product_title']).lower() in str(x['query']).lower() else 0, axis=1
    )
    
    # Word overlap features
    def word_overlap_ratio(text1, text2):
        words1 = set(str(text1).lower().split())
        words2 = set(str(text2).lower().split())
        if len(words1) == 0 or len(words2) == 0:
            return 0
        intersection = len(words1.intersection(words2))
        union = len(words1.union(words2))
        return intersection / union if union > 0 else 0
    
    def word_jaccard_similarity(text1, text2):
        words1 = set(str(text1).lower().split())
        words2 = set(str(text2).lower().split())
        if len(words1) == 0 and len(words2) == 0:
            return 1
        if len(words1) == 0 or len(words2) == 0:
            return 0
        intersection = len(words1.intersection(words2))
        union = len(words1.union(words2))
        return intersection / union
    
    print("📊 Similarity features hesaplanıyor...")
    features['query_title_word_overlap'] = features.apply(
        lambda x: word_overlap_ratio(x['query'], x['product_title']), axis=1
    )
    
    features['query_title_jaccard'] = features.apply(
        lambda x: word_jaccard_similarity(x['query'], x['product_title']), axis=1
    )
    
    # Brand matching
    features['brand_in_query'] = features.apply(
        lambda x: 1 if str(x['product_brand']).lower() in str(x['query']).lower() and len(str(x['product_brand'])) > 0 else 0, axis=1
    )
    
    return features

# Create features
print("⚙️ Basic text features oluşturuluyor...")
df_master_features = create_basic_text_features(df_master)

print("⚙️ Similarity features oluşturuluyor...")
df_master_features = create_similarity_features(df_master_features)

# Add ESCI label encoding
esci_mapping = Config.ESCI_MAPPING
df_master_features['esci_score'] = df_master_features['esci_label'].map(esci_mapping)

print(f"\n✅ Feature engineering tamamlandı!")
print(f"Final dataset shape: {df_master_features.shape}")

# Show feature summary
feature_cols = [col for col in df_master_features.columns if col.endswith(('_len', '_count', '_words', '_overlap', '_jaccard', '_in_', 'has_'))]
print(f"Created {len(feature_cols)} features:")
for i, col in enumerate(feature_cols):
    if i % 3 == 0:
        print()
    print(f"  {col:25}", end="")
print()

## 7. Create Task-Specific Datasets

In [None]:
print_section("TASK-SPECIFIC DATASETS")

# Filter for English locale only
print("🌍 Filtering for English locale (US)...")
df_english = df_master_features[df_master_features['product_locale'] == Config.LANGUAGE].copy()
print(f"English dataset shape: {df_english.shape}")

# Task 1: Query-Product Ranking (small version)
print("\n🎯 Task 1: Query-Product Ranking")
print("-" * 40)
df_task1 = df_english[df_english['small_version'] == 1].copy()

df_task1_train = df_task1[df_task1['split'] == 'train'].copy()
df_task1_test = df_task1[df_task1['split'] == 'test'].copy()

print(f"Task 1 Total: {len(df_task1):,} examples")
print(f"  Train: {len(df_task1_train):,} examples")
print(f"  Test:  {len(df_task1_test):,} examples")
print(f"  Unique queries: {df_task1['query_id'].nunique():,}")

# ESCI distribution for Task 1
task1_esci = df_task1['esci_label'].value_counts().sort_index()
print("  ESCI distribution:")
for label, count in task1_esci.items():
    print(f"    {label}: {count:,} ({count/len(df_task1)*100:.1f}%)")

# Task 2: Multi-class Product Classification (large version)
print("\n🎯 Task 2: Multi-class Product Classification")
print("-" * 50)
df_task2 = df_english[df_english['large_version'] == 1].copy()

df_task2_train = df_task2[df_task2['split'] == 'train'].copy()
df_task2_test = df_task2[df_task2['split'] == 'test'].copy()

print(f"Task 2 Total: {len(df_task2):,} examples")
print(f"  Train: {len(df_task2_train):,} examples")
print(f"  Test:  {len(df_task2_test):,} examples")
print(f"  Unique queries: {df_task2['query_id'].nunique():,}")

# ESCI distribution for Task 2
task2_esci = df_task2['esci_label'].value_counts().sort_index()
print("  ESCI distribution:")
for label, count in task2_esci.items():
    print(f"    {label}: {count:,} ({count/len(df_task2)*100:.1f}%)")

# Task 3: Product Substitute Identification (large version)
print("\n🎯 Task 3: Product Substitute Identification")
print("-" * 45)
df_task3 = df_english[df_english['large_version'] == 1].copy()

# Create substitute label (1 if S, 0 otherwise)
df_task3['substitute_label'] = (df_task3['esci_label'] == 'S').astype(int)

df_task3_train = df_task3[df_task3['split'] == 'train'].copy()
df_task3_test = df_task3[df_task3['split'] == 'test'].copy()

print(f"Task 3 Total: {len(df_task3):,} examples")
print(f"  Train: {len(df_task3_train):,} examples")
print(f"  Test:  {len(df_task3_test):,} examples")
print(f"  Unique queries: {df_task3['query_id'].nunique():,}")

# Substitute distribution for Task 3
substitute_dist = df_task3['substitute_label'].value_counts()
print("  Substitute distribution:")
print(f"    Non-Substitute (0): {substitute_dist[0]:,} ({substitute_dist[0]/len(df_task3)*100:.1f}%)")
print(f"    Substitute (1):     {substitute_dist[1]:,} ({substitute_dist[1]/len(df_task3)*100:.1f}%)")

# Create task datasets dictionary for easy access
task_datasets = {
    'task1': {
        'train': df_task1_train,
        'test': df_task1_test,
        'full': df_task1,
        'target_col': 'esci_score',
        'description': 'Query-Product Ranking (Small Version)'
    },
    'task2': {
        'train': df_task2_train,
        'test': df_task2_test,
        'full': df_task2,
        'target_col': 'esci_label',
        'description': 'Multi-class Product Classification (Large Version)'
    },
    'task3': {
        'train': df_task3_train,
        'test': df_task3_test,
        'full': df_task3,
        'target_col': 'substitute_label',
        'description': 'Product Substitute Identification (Large Version)'
    }
}

print(f"\n✅ Task-specific datasets hazırlandı!")
print(f"   📊 Task 1 (Ranking): {len(df_task1):,} examples")
print(f"   🎯 Task 2 (Classification): {len(df_task2):,} examples")
print(f"   🔍 Task 3 (Substitute): {len(df_task3):,} examples")

In [None]:
# KDD Cup Official Task Preparation (exactly as specified)
print_section("KDD CUP OFFICIAL TASK PREPARATION")

# Filter and prepare for Task 1 - exactly as specified
print("🎯 Filter and prepare for Task 1")
df_task_1 = df_examples_products[df_examples_products["small_version"] == 1]
df_task_1_train = df_task_1[df_task_1["split"] == "train"]
df_task_1_test = df_task_1[df_task_1["split"] == "test"]

print(f"Task 1 (official):")
print(f"  Total: {len(df_task_1):,}")
print(f"  Train: {len(df_task_1_train):,}")
print(f"  Test:  {len(df_task_1_test):,}")

# Filter and prepare data for Task 2 - exactly as specified
print("\n🎯 Filter and prepare data for Task 2")
df_task_2 = df_examples_products[df_examples_products["large_version"] == 1]
df_task_2_train = df_task_2[df_task_2["split"] == "train"]
df_task_2_test = df_task_2[df_task_2["split"] == "test"]

print(f"Task 2 (official):")
print(f"  Total: {len(df_task_2):,}")
print(f"  Train: {len(df_task_2_train):,}")
print(f"  Test:  {len(df_task_2_test):,}")

# Filter and prepare data for Task 3 - exactly as specified
print("\n🎯 Filter and prepare data for Task 3")
df_task_3 = df_examples_products[df_examples_products["large_version"] == 1]
df_task_3["substitute_label"] = df_task_3["esci_label"].apply(lambda esci_label: 1 if esci_label == "S" else 0)
# Note: keeping esci_label column (not deleting as in original spec)
df_task_3_train = df_task_3[df_task_3["split"] == "train"]
df_task_3_test = df_task_3[df_task_3["split"] == "test"]

print(f"Task 3 (official):")
print(f"  Total: {len(df_task_3):,}")
print(f"  Train: {len(df_task_3_train):,}")
print(f"  Test:  {len(df_task_3_test):,}")

# Optional: Merge queries with sources (as specified)
print("\n🔗 Merge queries with sources (optional)")
df_examples_products_source = pd.merge( 
    df_examples_products, 
    df_sources, 
    how='left', 
    left_on=['query_id'],
    right_on=['query_id']
)

print(f"With sources shape: {df_examples_products_source.shape}")

print(f"\n✅ KDD Cup official task preparation tamamlandı!")
print(f"   📊 df_task_1: {len(df_task_1):,} examples")
print(f"   🎯 df_task_2: {len(df_task_2):,} examples")
print(f"   🔍 df_task_3: {len(df_task_3):,} examples")
print(f"   📋 df_examples_products_source: {len(df_examples_products_source):,} examples")

## 7.5. LightGBM Model Training & Evaluation

In [None]:
print_section("LIGHTGBM MODEL TRAINING")

# Import LightGBM and additional libraries
import lightgbm as lgb
from sklearn.model_selection import train_test_split
from sklearn.metrics import ndcg_score, accuracy_score, classification_report
from sklearn.preprocessing import LabelEncoder
import time

print("📦 LightGBM ve metric kütüphaneleri yüklendi!")

# Prepare feature-enhanced data for Task 1 (using English subset with features)
print("\n🎯 Task 1 için feature-enhanced dataset hazırlanıyor...")
df_task1_features = df_english[df_english['small_version'] == 1].copy()

print(f"Task 1 dataset shape: {df_task1_features.shape}")
print(f"Feature columns available: {len([col for col in df_task1_features.columns if col.endswith(('_len', '_count', '_words', '_overlap', '_jaccard', '_in_', 'has_'))])}")

# Split train/test
train_data = df_task1_features[df_task1_features['split'] == 'train'].copy()
test_data = df_task1_features[df_task1_features['split'] == 'test'].copy()

print(f"Train size: {len(train_data):,}")
print(f"Test size: {len(test_data):,}")
print(f"Train queries: {train_data['query_id'].nunique():,}")
print(f"Test queries: {test_data['query_id'].nunique():,}")

In [None]:
# Feature preparation
print("\n⚙️ Feature hazırlığı yapılıyor...")

# Select feature columns that we created
feature_cols = [col for col in train_data.columns if col.endswith(('_len', '_count', '_words', '_overlap', '_jaccard', '_in_', 'has_'))]
print(f"Selected features: {len(feature_cols)}")
for i, col in enumerate(feature_cols):
    if i % 3 == 0:
        print()
    print(f"  {col:25}", end="")
print()

# Prepare features and target
X_train = train_data[feature_cols].fillna(0)
y_train = train_data['esci_score']
X_test = test_data[feature_cols].fillna(0)
y_test = test_data['esci_score']

print(f"\nFeatures shape: {X_train.shape}")
print(f"Target distribution in train:")
print(train_data['esci_label'].value_counts().sort_index())

# Prepare ranking data (group by query_id for LightGBM ranker)
print("\n📊 Ranking data hazırlanıyor...")
train_sorted = train_data.sort_values('query_id').reset_index(drop=True)
test_sorted = test_data.sort_values('query_id').reset_index(drop=True)

# Get features for sorted data
X_train_sorted = train_sorted[feature_cols].fillna(0)
y_train_sorted = train_sorted['esci_score']
X_test_sorted = test_sorted[feature_cols].fillna(0)
y_test_sorted = test_sorted['esci_score']

# Create group information (number of items per query)
train_groups = train_sorted.groupby('query_id').size().values
test_groups = test_sorted.groupby('query_id').size().values

print(f"Train groups: {len(train_groups)} queries")
print(f"Test groups: {len(test_groups)} queries")
print(f"Group sizes - Train: min={min(train_groups)}, max={max(train_groups)}, mean={np.mean(train_groups):.1f}")
print(f"Group sizes - Test: min={min(test_groups)}, max={max(test_groups)}, mean={np.mean(test_groups):.1f}")

In [None]:
# Train LightGBM Ranker
print("\n🚀 LightGBM Ranker eğitimi başlatılıyor...")

# Create train/validation split (query-level split to avoid leakage)
unique_train_queries = train_sorted['query_id'].unique()
val_queries, train_queries_final = train_test_split(unique_train_queries, test_size=0.8, random_state=42)

# Create validation mask
val_mask = train_sorted['query_id'].isin(val_queries)
train_final_mask = train_sorted['query_id'].isin(train_queries_final)

# Split data
X_train_final = X_train_sorted[train_final_mask]
y_train_final = y_train_sorted[train_final_mask]
train_groups_final = train_sorted[train_final_mask].groupby('query_id').size().values

X_val = X_train_sorted[val_mask]
y_val = y_train_sorted[val_mask]
val_groups = train_sorted[val_mask].groupby('query_id').size().values

print(f"Final train: {len(X_train_final)} samples, {len(train_groups_final)} queries")
print(f"Validation: {len(X_val)} samples, {len(val_groups)} queries")

# LightGBM parameters for ranking
params = {
    'objective': 'lambdarank',
    'metric': 'ndcg',
    'boosting_type': 'gbdt',
    'num_leaves': 31,
    'learning_rate': 0.05,
    'feature_fraction': 0.9,
    'bagging_fraction': 0.8,
    'bagging_freq': 5,
    'verbose': -1,
    'random_state': 42
}

print("📋 LightGBM parametreleri:")
for key, value in params.items():
    print(f"  {key}: {value}")

# Create datasets
print("\n📊 LightGBM datasets oluşturuluyor...")
start_time = time.time()

train_dataset = lgb.Dataset(
    X_train_final, 
    label=y_train_final, 
    group=train_groups_final
)

val_dataset = lgb.Dataset(
    X_val, 
    label=y_val, 
    group=val_groups,
    reference=train_dataset
)

# Train model
print("\n🎯 Model eğitimi başlıyor...")
model = lgb.train(
    params,
    train_dataset,
    valid_sets=[val_dataset],
    num_boost_round=500,
    callbacks=[
        lgb.early_stopping(stopping_rounds=50),
        lgb.log_evaluation(period=25)
    ]
)

training_time = time.time() - start_time
print(f"\n⏱️ Eğitim süresi: {training_time:.2f} saniye")
print(f"🏆 Best iteration: {model.best_iteration}")
print(f"📊 Best score: {model.best_score['valid_0']['ndcg@10']:.4f}")

In [None]:
# Model Evaluation
print_section("MODEL EVALUATION")

# Predict on test data
print("📊 Test seti üzerinde tahmin yapılıyor...")
y_pred_test = model.predict(X_test_sorted, num_iteration=model.best_iteration)

# Calculate NDCG scores for different k values
def calculate_ndcg_by_query(y_true, y_pred, test_groups, k_values=[1, 5, 10]):
    """Calculate NDCG@k for each query and return average"""
    ndcg_scores = {f'ndcg@{k}': [] for k in k_values}
    
    start_idx = 0
    for group_size in test_groups:
        end_idx = start_idx + group_size
        
        # Get true and predicted relevance for this query
        y_true_query = y_true[start_idx:end_idx].values
        y_pred_query = y_pred[start_idx:end_idx]
        
        # Calculate NDCG@k for this query
        for k in k_values:
            if len(y_true_query) >= k:
                ndcg_k = ndcg_score([y_true_query], [y_pred_query], k=k)
                ndcg_scores[f'ndcg@{k}'].append(ndcg_k)
        
        start_idx = end_idx
    
    # Calculate average NDCG@k
    avg_ndcg = {}
    for metric, scores in ndcg_scores.items():
        avg_ndcg[metric] = np.mean(scores) if scores else 0.0
    
    return avg_ndcg

# Calculate test performance
test_ndcg = calculate_ndcg_by_query(y_test_sorted, y_pred_test, test_groups)

print("📈 Test Set Performance:")
print("-" * 30)
for metric, score in test_ndcg.items():
    print(f"  {metric.upper()}: {score:.4f}")

# Feature importance analysis
print("\n🎯 Feature Importance Analysis:")
print("-" * 40)
feature_importance = model.feature_importance(importance_type='gain')
feature_names = feature_cols

# Create feature importance DataFrame
importance_df = pd.DataFrame({
    'feature': feature_names,
    'importance': feature_importance
}).sort_values('importance', ascending=False)

print("Top 10 Most Important Features:")
for i, (_, row) in enumerate(importance_df.head(10).iterrows()):
    print(f"  {i+1:2d}. {row['feature']:25} {row['importance']:8.0f}")

# Performance by ESCI label
print("\n📊 Performance by ESCI Label:")
print("-" * 35)
test_sorted_with_pred = test_sorted.copy()
test_sorted_with_pred['pred_score'] = y_pred_test

esci_performance = {}
for label in ['E', 'S', 'C', 'I']:
    label_mask = test_sorted_with_pred['esci_label'] == label
    if label_mask.sum() > 0:
        label_data = test_sorted_with_pred[label_mask]
        avg_true_score = label_data['esci_score'].mean()
        avg_pred_score = label_data['pred_score'].mean()
        count = len(label_data)
        esci_performance[label] = {
            'count': count,
            'avg_true': avg_true_score,
            'avg_pred': avg_pred_score
        }
        print(f"  {label} ({count:,} samples): True={avg_true_score:.3f}, Pred={avg_pred_score:.3f}")

# Query-level analysis
print("\n🔍 Query-level Analysis:")
print("-" * 25)
query_stats = test_sorted.groupby('query_id').agg({
    'esci_score': ['count', 'mean', 'std'],
    'query': 'first'
}).round(3)

query_stats.columns = ['query_size', 'avg_relevance', 'relevance_std', 'query_text']
print(f"Average query size: {query_stats['query_size'].mean():.1f}")
print(f"Query size range: {query_stats['query_size'].min()}-{query_stats['query_size'].max()}")
print(f"Average relevance: {query_stats['avg_relevance'].mean():.3f}")

# Best and worst performing queries
print("\nTop 5 queries by average relevance:")
top_queries = query_stats.nlargest(5, 'avg_relevance')
for i, (query_id, row) in enumerate(top_queries.iterrows()):
    print(f"  {i+1}. '{row['query_text'][:50]}...' (size: {row['query_size']}, avg: {row['avg_relevance']:.3f})")

print(f"\n✅ Model evaluation tamamlandı!")
print(f"🎯 Final NDCG@10: {test_ndcg['ndcg@10']:.4f}")
print(f"⏱️ Total processing time: {training_time:.2f} seconds")

## 8. Data Quality Checks

In [None]:
print_section("DATA QUALITY CHECKS")

def perform_quality_checks(task_name, task_data):
    """Perform comprehensive quality checks on task data"""
    print(f"\n🔍 {task_name} Quality Checks")
    print("-" * 40)
    
    train_df = task_data['train']
    test_df = task_data['test']
    target_col = task_data['target_col']
    
    # Basic checks
    print(f"✅ Train shape: {train_df.shape}")
    print(f"✅ Test shape: {test_df.shape}")
    
    # Missing values check
    train_missing = train_df.isnull().sum().sum()
    test_missing = test_df.isnull().sum().sum()
    print(f"✅ Train missing values: {train_missing}")
    print(f"✅ Test missing values: {test_missing}")
    
    # Target distribution check
    if target_col in train_df.columns:
        target_dist = train_df[target_col].value_counts().sort_index()
        print(f"✅ Target distribution:")
        for value, count in target_dist.items():
            print(f"   {value}: {count:,} ({count/len(train_df)*100:.1f}%)")
    
    # Feature completeness
    feature_cols = [col for col in train_df.columns if col.endswith(('_len', '_count', '_words', '_overlap', '_jaccard', '_in_', 'has_'))]
    feature_missing = train_df[feature_cols].isnull().sum().sum()
    print(f"✅ Feature columns: {len(feature_cols)}")
    print(f"✅ Feature missing values: {feature_missing}")
    
    # Query overlap check (no query should be in both train and test)
    train_queries = set(train_df['query_id'].unique())
    test_queries = set(test_df['query_id'].unique())
    query_overlap = len(train_queries.intersection(test_queries))
    
    if query_overlap == 0:
        print(f"✅ No query overlap between train/test")
    else:
        print(f"⚠️  Query overlap detected: {query_overlap} queries")
    
    print(f"✅ Train unique queries: {len(train_queries):,}")
    print(f"✅ Test unique queries: {len(test_queries):,}")
    
    return {
        'train_shape': train_df.shape,
        'test_shape': test_df.shape,
        'train_missing': train_missing,
        'test_missing': test_missing,
        'feature_count': len(feature_cols),
        'query_overlap': query_overlap
    }

# Perform quality checks for all tasks
quality_results = {}

for task_id, task_data in task_datasets.items():
    task_name = f"Task {task_id[-1]}: {task_data['description']}"
    quality_results[task_id] = perform_quality_checks(task_name, task_data)

# Summary of all checks
print(f"\n\n📋 QUALITY CHECK SUMMARY")
print("=" * 50)

all_checks_passed = True
for task_id, results in quality_results.items():
    task_num = task_id[-1]
    status = "✅ PASSED" if results['query_overlap'] == 0 and results['train_missing'] == 0 else "⚠️  WARNING"
    
    if results['query_overlap'] > 0 or results['train_missing'] > 0:
        all_checks_passed = False
    
    print(f"Task {task_num}: {status}")
    print(f"  Train: {results['train_shape'][0]:,} rows, {results['train_shape'][1]} cols")
    print(f"  Test:  {results['test_shape'][0]:,} rows, {results['test_shape'][1]} cols")
    print(f"  Features: {results['feature_count']}")
    print(f"  Missing: {results['train_missing']} (train), {results['test_missing']} (test)")
    print(f"  Query overlap: {results['query_overlap']}")
    print()

if all_checks_passed:
    print("🎉 All quality checks passed! Data is ready for training.")
else:
    print("⚠️  Some quality issues detected. Please review before training.")

# Create feature list for reference
feature_columns = [col for col in df_task1.columns if col.endswith(('_len', '_count', '_words', '_overlap', '_jaccard', '_in_', 'has_'))]
print(f"\n📊 Available Features ({len(feature_columns)}):")
for i, col in enumerate(sorted(feature_columns)):
    if i % 2 == 0:
        print()
    print(f"  {col:35}", end="")

## 9. Save Preprocessed Data

In [None]:
print_section("SAVING PROCESSED DATA")

# Create output directories
output_dirs = {
    'task1': Config.PROCESSED_DATA_DIR / "task_1",
    'task2': Config.PROCESSED_DATA_DIR / "task_2", 
    'task3': Config.PROCESSED_DATA_DIR / "task_3"
}

for task_dir in output_dirs.values():
    task_dir.mkdir(parents=True, exist_ok=True)

# Save datasets
print("💾 Saving processed datasets...")

saved_files = []

for task_id, task_data in task_datasets.items():
    task_dir = output_dirs[task_id]
    task_num = task_id[-1]
    
    # Save train and test sets
    train_file = task_dir / f"train.parquet"
    test_file = task_dir / f"test.parquet"
    full_file = task_dir / f"full.parquet"
    
    task_data['train'].to_parquet(train_file, index=False)
    task_data['test'].to_parquet(test_file, index=False) 
    task_data['full'].to_parquet(full_file, index=False)
    
    saved_files.extend([train_file, test_file, full_file])
    
    print(f"✅ Task {task_num} saved:")
    print(f"   📁 {train_file.relative_to(Config.BASE_DIR)}")
    print(f"   📁 {test_file.relative_to(Config.BASE_DIR)}")
    print(f"   📁 {full_file.relative_to(Config.BASE_DIR)}")

# Save feature metadata
feature_metadata = {
    'feature_columns': feature_columns,
    'esci_mapping': Config.ESCI_MAPPING,
    'text_fields': ['query', 'product_title', 'product_description', 'product_bullet_point', 'product_brand'],
    'created_features': {
        'basic_text': [col for col in feature_columns if col.endswith(('_len', '_count', '_words'))],
        'similarity': [col for col in feature_columns if col.endswith(('_overlap', '_jaccard', '_in_'))],
        'categorical': [col for col in feature_columns if col.startswith('has_')]
    },
    'preprocessing_info': {
        'total_examples': len(df_examples),
        'english_examples': len(df_english),
        'feature_count': len(feature_columns),
        'quality_passed': all_checks_passed
    }
}

# Save metadata
metadata_file = Config.PROCESSED_DATA_DIR / "feature_metadata.pkl"
with open(metadata_file, 'wb') as f:
    pickle.dump(feature_metadata, f)

saved_files.append(metadata_file)
print(f"✅ Feature metadata saved: {metadata_file.relative_to(Config.BASE_DIR)}")

# Create summary CSV
summary_data = []
for task_id, task_data in task_datasets.items():
    task_num = task_id[-1]
    train_df = task_data['train']
    test_df = task_data['test']
    
    summary_data.append({
        'task': f"Task {task_num}",
        'description': task_data['description'],
        'train_samples': len(train_df),
        'test_samples': len(test_df),
        'total_samples': len(task_data['full']),
        'unique_queries': task_data['full']['query_id'].nunique(),
        'target_column': task_data['target_col'],
        'feature_count': len(feature_columns)
    })

summary_df = pd.DataFrame(summary_data)
summary_file = Config.PROCESSED_DATA_DIR / "dataset_summary.csv"
summary_df.to_csv(summary_file, index=False)
saved_files.append(summary_file)

print(f"✅ Dataset summary saved: {summary_file.relative_to(Config.BASE_DIR)}")

# Display summary
print(f"\n📊 DATASET SUMMARY")
print("=" * 50)
display(summary_df)

print(f"\n🎉 PREPROCESSING COMPLETED!")
print("=" * 50)
print(f"✅ Processed {len(df_examples):,} original examples")
print(f"✅ Created {len(feature_columns)} features") 
print(f"✅ Generated {len(task_datasets)} task-specific datasets")
print(f"✅ Saved {len(saved_files)} files")
print(f"✅ Quality checks: {'PASSED' if all_checks_passed else 'WARNINGS'}")

print(f"\n📁 Output Directory: {Config.PROCESSED_DATA_DIR}")
print(f"📁 Files saved:")
for file_path in saved_files:
    file_size = file_path.stat().st_size / (1024*1024)  # MB
    print(f"   {str(file_path.relative_to(Config.BASE_DIR)):50} ({file_size:.1f} MB)")

print(f"\n🚀 Ready for training! Use the following files:")
print(f"   Task 1 (Ranking): data/processed/task_1/")
print(f"   Task 2 (Classification): data/processed/task_2/")
print(f"   Task 3 (Substitute): data/processed/task_3/")
print(f"\n💡 Next step: python main.py --task 1")