# Smart Product Pricing Challenge: ML Pipeline

This notebook provides a comprehensive end-to-end machine learning pipeline to predict product prices for the Smart Product Pricing Challenge. The goal is to build a reproducible pipeline that achieves a SMAPE (Symmetric Mean Absolute Percentage Error) below 10%.

## Table of Contents
1. [Setup & Environment Configuration](#setup)
2. [Data Loading & Exploration](#data-loading)
3. [Text Processing & Feature Engineering](#text-processing)
4. [Image Processing & Feature Engineering](#image-processing)
5. [Model Training & Cross-Validation](#model-training)
6. [Ensemble & Stacking](#ensemble)
7. [Prediction & Submission Generation](#prediction)
8. [Quick Baseline (If You're in a Hurry)](#quick-baseline)
9. [Utility Functions](#utility-functions)

**Important Ethical Note:** This pipeline does NOT use external price lookup or web-scraped/external price data. Only the provided dataset and files are used for model training and prediction. Using external price data would violate competition rules and is grounds for disqualification.

## Setup & Environment Configuration

In [None]:
# Import necessary libraries
import os
import sys
import json
import pickle
import warnings
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from tqdm.auto import tqdm
import re
import gc
import logging
from pathlib import Path
from functools import partial
import multiprocessing
from collections import defaultdict, Counter
import time
import joblib
from datetime import datetime

# Machine Learning
import sklearn
from sklearn.model_selection import KFold, StratifiedKFold, train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer, HashingVectorizer
from sklearn.decomposition import TruncatedSVD, PCA
from sklearn.preprocessing import StandardScaler, MinMaxScaler, OneHotEncoder, LabelEncoder
from sklearn.linear_model import Ridge, ElasticNet
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import lightgbm as lgb

# Configure warnings
warnings.filterwarnings('ignore')

# Set random seeds for reproducibility
RANDOM_SEED = 42
np.random.seed(RANDOM_SEED)
import random
random.seed(RANDOM_SEED)

# Try importing torch and set its seed if available
try:
    import torch
    torch.manual_seed(RANDOM_SEED)
    if torch.cuda.is_available():
        torch.cuda.manual_seed_all(RANDOM_SEED)
        torch.backends.cudnn.deterministic = True
        torch.backends.cudnn.benchmark = False
    TORCH_AVAILABLE = True
except ImportError:
    TORCH_AVAILABLE = False
    print("PyTorch is not installed. Will use CPU-only models.")

# Configure logging
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(name)s - %(levelname)s - %(message)s')

In [None]:
# Auto-detect if running on Kaggle or locally
if os.path.exists('/kaggle/input'):
    # Kaggle environment
    BASE_PATH = '/kaggle/input'
    OUTPUT_PATH = '/kaggle/working'
    print("Running in Kaggle environment")
else:
    # Local environment - adjust these paths based on your local setup
    BASE_PATH = os.path.join(os.path.dirname(os.getcwd()), 'dataset')
    OUTPUT_PATH = os.getcwd()
    print(f"Running in local environment: {os.getcwd()}")
    
    # Handle specific local paths if needed
    if not os.path.exists(BASE_PATH):
        possible_paths = [
            './dataset',
            '../dataset',
            './student_resource/dataset',
            '../student_resource/dataset'
        ]
        for path in possible_paths:
            if os.path.exists(path):
                BASE_PATH = path
                print(f"Found dataset at: {BASE_PATH}")
                break
        
# Create output directory if it doesn't exist
os.makedirs(OUTPUT_PATH, exist_ok=True)

# Add utils.py directory to path if needed
sys.path.append(os.path.join(os.path.dirname(os.getcwd()), 'src'))
try:
    from utils import download_images
    print("Successfully imported utils.py")
except ImportError:
    sys.path.append('./src')
    sys.path.append('../src')
    sys.path.append('./student_resource/src')
    sys.path.append('../student_resource/src')
    try:
        from utils import download_images
        print("Successfully imported utils.py from alternate path")
    except ImportError:
        print("Warning: Could not import utils.py - image processing may not be available")
        
        # Define a fallback download_images function
        def download_image(image_link, savefolder):
            if(isinstance(image_link, str)):
                filename = Path(image_link).name
                image_save_path = os.path.join(savefolder, filename)
                if(not os.path.exists(image_save_path)):
                    try:
                        import urllib.request
                        urllib.request.urlretrieve(image_link, image_save_path)    
                    except Exception as ex:
                        print('Warning: Not able to download - {}\n{}'.format(image_link, ex))
                else:
                    return
            return

        def download_images(image_links, download_folder):
            if not os.path.exists(download_folder):
                os.makedirs(download_folder)
            results = []
            download_image_partial = partial(download_image, savefolder=download_folder)
            with multiprocessing.Pool(min(100, multiprocessing.cpu_count())) as pool:
                for result in tqdm(pool.imap(download_image_partial, image_links), total=len(image_links)):
                    results.append(result)
                pool.close()
                pool.join()
            print(f"Downloaded images to {download_folder}")

# Set and validate file paths
TRAIN_PATH = os.path.join(BASE_PATH, 'train.csv')
TEST_PATH = os.path.join(BASE_PATH, 'test.csv')
SAMPLE_TEST_PATH = os.path.join(BASE_PATH, 'sample_test.csv')
SAMPLE_TEST_OUT_PATH = os.path.join(BASE_PATH, 'sample_test_out.csv')
OUTPUT_CSV_PATH = os.path.join(OUTPUT_PATH, 'test_out.csv')
METRICS_PATH = os.path.join(OUTPUT_PATH, 'oof_metrics.json')
CACHE_DIR = os.path.join(OUTPUT_PATH, 'cache')

# Create cache directory for storing model artifacts and intermediate features
os.makedirs(CACHE_DIR, exist_ok=True)

# Set environment variables for HuggingFace
os.environ["TOKENIZERS_PARALLELISM"] = "false"
os.environ["HF_INFERENCE_ENDPOINT"] = ""  # Empty to avoid using HF inference endpoint

# Print environment info
print(f"Python version: {sys.version}")
print(f"Pandas version: {pd.__version__}")
print(f"Numpy version: {np.__version__}")
print(f"Scikit-learn version: {sklearn.__version__}")
print(f"LightGBM version: {lgb.__version__}")
print(f"PyTorch available: {TORCH_AVAILABLE}")
if TORCH_AVAILABLE:
    print(f"PyTorch version: {torch.__version__}")
    print(f"CUDA available: {torch.cuda.is_available()}")
    if torch.cuda.is_available():
        print(f"CUDA version: {torch.version.cuda}")
        print(f"GPU: {torch.cuda.get_device_name(0)}")

# Set thread limits
# Limit CPU usage to avoid competition on shared resources
try:
    import os
    os.environ["OMP_NUM_THREADS"] = "4"
    os.environ["OPENBLAS_NUM_THREADS"] = "4"
    os.environ["MKL_NUM_THREADS"] = "4"
    os.environ["VECLIB_MAXIMUM_THREADS"] = "4"
    os.environ["NUMEXPR_NUM_THREADS"] = "4"
except Exception as e:
    print(f"Error setting thread limits: {e}")

print(f"Setup complete. Data path: {BASE_PATH}, Output path: {OUTPUT_PATH}")

## Data Loading & Exploration

In [None]:
# Load the data
try:
    train = pd.read_csv(TRAIN_PATH)
    test = pd.read_csv(TEST_PATH)
    print(f"Train data shape: {train.shape}")
    print(f"Test data shape: {test.shape}")
except FileNotFoundError as e:
    print(f"Error loading data: {e}")
    print("Trying to load sample test data instead...")
    try:
        train = pd.read_csv(SAMPLE_TEST_PATH)
        sample_test_out = pd.read_csv(SAMPLE_TEST_OUT_PATH)
        train = pd.merge(train, sample_test_out, on='sample_id', how='inner')
        test = train.copy()  # For demonstration purposes only
        print(f"Sample test data shape: {train.shape}")
    except FileNotFoundError as e2:
        print(f"Error loading sample data: {e2}")
        print("Please ensure that the dataset files are in the correct location.")

# Check for missing values
print("\nMissing values in train data:")
print(train.isnull().sum())

print("\nMissing values in test data:")
print(test.isnull().sum())

# Display a few sample rows from the train data
print("\nSample rows from train data:")
display(train.head(3))

In [None]:
# Analyze price distribution (for train data only)
if 'price' in train.columns:
    plt.figure(figsize=(12, 6))
    
    # Original price distribution
    plt.subplot(1, 2, 1)
    sns.histplot(train['price'], bins=50, kde=True)
    plt.title('Original Price Distribution')
    plt.xlabel('Price')
    plt.ylabel('Frequency')
    plt.xscale('log')
    
    # Log-transformed price distribution
    plt.subplot(1, 2, 2)
    sns.histplot(np.log1p(train['price']), bins=50, kde=True)
    plt.title('Log-transformed Price Distribution')
    plt.xlabel('Log(Price + 1)')
    plt.ylabel('Frequency')
    
    plt.tight_layout()
    plt.show()
    
    # Calculate statistics for the price
    price_stats = train['price'].describe()
    print("Price statistics:")
    print(price_stats)
    
    # Identify outliers in price
    Q1 = train['price'].quantile(0.25)
    Q3 = train['price'].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    
    outliers = train[(train['price'] < lower_bound) | (train['price'] > upper_bound)]
    print(f"\nNumber of price outliers: {len(outliers)} ({len(outliers) / len(train) * 100:.2f}%)")
    
    # Calculate and display percentiles
    percentiles = [0.01, 0.05, 0.1, 0.25, 0.5, 0.75, 0.9, 0.95, 0.99, 0.999]
    price_percentiles = np.percentile(train['price'], [p * 100 for p in percentiles])
    
    percentile_df = pd.DataFrame({
        'Percentile': [f"{p*100}%" for p in percentiles],
        'Price': price_percentiles
    })
    print("\nPrice percentiles:")
    print(percentile_df)
else:
    print("Price column not available in the data.")

In [None]:
# Examine the catalog_content field structure
print("\nSample catalog_content from first entry:")
if 'catalog_content' in train.columns:
    print(train['catalog_content'].iloc[0][:500] + '...')
    
    # Extract the average length of catalog content
    content_lens = train['catalog_content'].str.len()
    print(f"\nAverage catalog_content length: {content_lens.mean():.2f} characters")
    print(f"Min catalog_content length: {content_lens.min()} characters")
    print(f"Max catalog_content length: {content_lens.max()} characters")
    
    # Check for patterns in the catalog_content
    content_samples = train['catalog_content'].head(3)
    patterns = [
        "Item Name:", 
        "Bullet Point", 
        "Product Description:", 
        "Value:", 
        "Unit:"
    ]
    
    print("\nChecking for common patterns in catalog_content:")
    for pattern in patterns:
        match_count = sum(content_samples.str.contains(pattern, regex=False))
        print(f"Pattern '{pattern}' appears in {match_count} of 3 samples")
else:
    print("catalog_content column not available in the data.")

## Text Processing & Feature Engineering

In [None]:
# Define text processing functions
def clean_text(text):
    """Clean text by removing URLs, special characters, and converting to lowercase"""
    if not isinstance(text, str):
        return ""
        
    # Convert to lowercase
    text = text.lower()
    
    # Remove URLs
    text = re.sub(r'http\S+|www\S+|https\S+', '', text)
    
    # Remove excessive punctuation
    text = re.sub(r'[^\w\s]', ' ', text)
    
    # Remove extra spaces
    text = re.sub(r'\s+', ' ', text).strip()
    
    return text

def extract_ipq(text):
    """Extract Item Pack Quantity (IPQ) from text"""
    if not isinstance(text, str):
        return 1
    
    text = text.lower()
    
    # Look for specific patterns indicating pack quantity
    patterns = [
        r'pack of (\d+)',
        r'(\d+)[-\s]pack',
        r'(\d+)\s*pcs',
        r'(\d+)\s*pieces',
        r'(\d+)\s*count',
        r'(\d+)\s*ct',
        r'(\d+)\s*pk',
        r'set of (\d+)',
        r'(\d+)\s*set',
        r'(\d+)\s*qty',
        r'quantity:\s*(\d+)',
        r'qty:\s*(\d+)',
        r'count:\s*(\d+)',
        r'value:\s*(\d+)',
    ]
    
    for pattern in patterns:
        match = re.search(pattern, text)
        if match:
            try:
                quantity = int(match.group(1))
                return max(1, min(quantity, 100))  # Cap at reasonable values
            except:
                pass
    
    # Check for 'Value: X' pattern which often indicates quantity
    value_match = re.search(r'value:\s*([\d\.]+)', text)
    if value_match:
        try:
            value = float(value_match.group(1))
            if value >= 1 and value <= 100:
                return int(value)
        except:
            pass
            
    # Default to 1 if no pattern is found
    return 1

def extract_brand(text):
    """Extract brand name from text using heuristics"""
    if not isinstance(text, str):
        return "Unknown"
    
    # Look for common brand patterns
    brand_patterns = [
        r'brand:\s*([A-Za-z0-9][A-Za-z0-9\s&\-]+)',
        r'by\s+([A-Z][A-Za-z0-9\s&\-]+)',
        r'from\s+([A-Z][A-Za-z0-9\s&\-]+)',
        r'item name:\s*([A-Z][A-Za-z0-9\s&\-]+)'
    ]
    
    for pattern in patterns:
        match = re.search(pattern, text, re.IGNORECASE)
        if match:
            brand = match.group(1).strip()
            # Limit length and filter out generic terms
            if len(brand) > 1 and len(brand) < 30:
                return brand
    
    # Try to extract first word from Item Name if it's uppercase
    item_name_match = re.search(r'item name:([^,\n]+)', text, re.IGNORECASE)
    if item_name_match:
        item_name = item_name_match.group(1).strip()
        first_word = item_name.split()[0] if item_name.split() else ""
        if first_word and first_word[0].isupper() and len(first_word) > 1:
            return first_word
    
    # Try the first word if it's all caps or first letter is capitalized
    words = text.split()
    if words and len(words[0]) > 1:
        if words[0].isupper() or (words[0][0].isupper() and not words[0].isupper()):
            return words[0]
    
    return "Unknown"

def extract_title(text):
    """Extract title from catalog content"""
    if not isinstance(text, str):
        return ""
    
    # Try to find item name pattern
    item_name_match = re.search(r'item name:(.*?)(?:bullet point|product description|$)', 
                               text, re.IGNORECASE | re.DOTALL)
    
    if item_name_match:
        title = item_name_match.group(1).strip()
        return title
    
    # If no specific pattern, take the first line or first 100 characters
    lines = text.split('\n')
    if lines:
        return lines[0].strip()
    
    return text[:100] if len(text) > 100 else text

def extract_description(text):
    """Extract product description from catalog content"""
    if not isinstance(text, str):
        return ""
    
    # Try to find product description pattern
    desc_match = re.search(r'product description:(.*?)(?:value:|unit:|$)', 
                           text, re.IGNORECASE | re.DOTALL)
    
    if desc_match:
        description = desc_match.group(1).strip()
        return description
    
    # If no specific pattern, take everything after the first line
    lines = text.split('\n')
    if len(lines) > 1:
        return ' '.join(lines[1:]).strip()
    
    return ""

def extract_bullet_points(text):
    """Extract bullet points from catalog content"""
    if not isinstance(text, str):
        return ""
    
    # Try to find bullet point pattern
    bullet_points = re.findall(r'bullet point \d+:(.*?)(?=bullet point \d+:|product description:|$)', 
                               text, re.IGNORECASE | re.DOTALL)
    
    if bullet_points:
        return ' '.join([bp.strip() for bp in bullet_points])
    
    return ""

def extract_basic_features(text):
    """Extract basic text features like length, word count, etc."""
    if not isinstance(text, str):
        text = ""
    
    features = {}
    
    # Text length
    features['text_len'] = len(text)
    
    # Number of words
    words = text.split()
    features['num_words'] = len(words)
    
    # Average word length
    if features['num_words'] > 0:
        features['avg_word_len'] = sum(len(word) for word in words) / features['num_words']
    else:
        features['avg_word_len'] = 0
    
    # Number of digits
    features['num_digits'] = sum(c.isdigit() for c in text)
    
    # Number of uppercase letters
    features['num_upper'] = sum(c.isupper() for c in text)
    
    # Number of lowercase letters
    features['num_lower'] = sum(c.islower() for c in text)
    
    # Ratio of uppercase to all letters
    total_letters = features['num_upper'] + features['num_lower']
    features['upper_ratio'] = features['num_upper'] / total_letters if total_letters > 0 else 0
    
    # Number of bullet points
    features['num_bullets'] = text.lower().count('bullet point')
    
    return features

print("Text processing functions defined successfully.")

In [None]:
# Apply text processing to train and test data
def process_catalog_content(df):
    """
    Process catalog content and extract features
    Returns the dataframe with additional columns
    """
    if 'catalog_content' not in df.columns:
        print("Warning: catalog_content not found in dataframe")
        return df
    
    print("Processing catalog content...")
    
    # Create copies of the features to avoid modifying the original
    df_processed = df.copy()
    
    # Extract text components
    tqdm.pandas(desc="Extracting title")
    df_processed['title'] = df_processed['catalog_content'].progress_apply(extract_title)
    
    tqdm.pandas(desc="Extracting description")
    df_processed['description'] = df_processed['catalog_content'].progress_apply(extract_description)
    
    tqdm.pandas(desc="Extracting bullet points")
    df_processed['bullet_points'] = df_processed['catalog_content'].progress_apply(extract_bullet_points)
    
    # Clean text fields
    tqdm.pandas(desc="Cleaning title")
    df_processed['clean_title'] = df_processed['title'].progress_apply(clean_text)
    
    tqdm.pandas(desc="Cleaning description")
    df_processed['clean_description'] = df_processed['description'].progress_apply(clean_text)
    
    tqdm.pandas(desc="Cleaning bullet points")
    df_processed['clean_bullet_points'] = df_processed['bullet_points'].progress_apply(clean_text)
    
    # Combine all cleaned text for a single text feature
    df_processed['all_text'] = (df_processed['clean_title'] + ' ' + 
                             df_processed['clean_description'] + ' ' + 
                             df_processed['clean_bullet_points'])
    
    # Extract IPQ and brand
    tqdm.pandas(desc="Extracting IPQ")
    df_processed['ipq'] = df_processed['catalog_content'].progress_apply(extract_ipq)
    
    tqdm.pandas(desc="Extracting brand")
    df_processed['brand'] = df_processed['catalog_content'].progress_apply(extract_brand)
    
    # Extract basic text features
    tqdm.pandas(desc="Extracting basic features")
    basic_features = df_processed['all_text'].progress_apply(extract_basic_features)
    
    # Convert dictionary of features to columns
    for feature in ['text_len', 'num_words', 'avg_word_len', 'num_digits', 
                   'num_upper', 'num_lower', 'upper_ratio', 'num_bullets']:
        df_processed[feature] = basic_features.apply(lambda x: x.get(feature, 0))
    
    print("Catalog content processing completed")
    return df_processed

# Apply processing to train and test data
print("Processing train data...")
train_processed = process_catalog_content(train)

print("\nProcessing test data...")
test_processed = process_catalog_content(test)

# Print shape of processed data
print(f"Processed train data shape: {train_processed.shape}")
print(f"Processed test data shape: {test_processed.shape}")

# Display sample of processed data
print("\nSample of processed train data:")
display(train_processed[['title', 'description', 'ipq', 'brand', 'text_len', 'num_words']].head(3))

In [None]:
# Encode categorical features (brand)
def encode_categorical_features(train_df, test_df, categorical_cols=['brand']):
    """Encode categorical features using label encoding with Unknown handling"""
    encoders = {}
    train_df_encoded = train_df.copy()
    test_df_encoded = test_df.copy()
    
    for col in categorical_cols:
        if col in train_df.columns and col in test_df.columns:
            print(f"Encoding {col}...")
            
            # Initialize LabelEncoder
            encoder = LabelEncoder()
            
            # Get all unique values from both train and test
            all_values = pd.concat([
                train_df[col].fillna('Unknown'),
                test_df[col].fillna('Unknown')
            ]).unique()
            
            # Make sure 'Unknown' is in the values
            if 'Unknown' not in all_values:
                all_values = np.append(all_values, 'Unknown')
                
            # Fit encoder on all values
            encoder.fit(all_values)
            
            # Transform train and test data
            train_df_encoded[f'{col}_encoded'] = encoder.transform(train_df[col].fillna('Unknown'))
            test_df_encoded[f'{col}_encoded'] = encoder.transform(test_df[col].fillna('Unknown'))
            
            # Store encoder for later use
            encoders[col] = encoder
            
            # Calculate value counts for information
            val_counts = train_df[col].value_counts()
            print(f"Top 5 most common values for {col}: ")
            print(val_counts.head(5))
            print(f"Total unique values: {len(val_counts)}")
    
    return train_df_encoded, test_df_encoded, encoders

# Apply categorical encoding
train_encoded, test_encoded, encoders = encode_categorical_features(
    train_processed, test_processed, categorical_cols=['brand']
)

# Display sample of encoded data
print("\nSample of encoded train data:")
display(train_encoded[['brand', 'brand_encoded']].head(5))

In [None]:
# Generate TF-IDF features and SVD reduction
def generate_tfidf_svd_features(train_df, test_df, text_col='all_text', 
                               cache_dir=CACHE_DIR, use_cache=True):
    """Generate TF-IDF features and apply SVD dimensionality reduction"""
    
    tfidf_cache_path = os.path.join(cache_dir, 'tfidf_vectorizer.pkl')
    svd_cache_path = os.path.join(cache_dir, 'tfidf_svd.pkl')
    train_tfidf_svd_cache_path = os.path.join(cache_dir, 'train_tfidf_svd_features.npz')
    test_tfidf_svd_cache_path = os.path.join(cache_dir, 'test_tfidf_svd_features.npz')
    
    # Check if cached files exist and use_cache is True
    if (use_cache and os.path.exists(tfidf_cache_path) and 
        os.path.exists(svd_cache_path) and 
        os.path.exists(train_tfidf_svd_cache_path) and 
        os.path.exists(test_tfidf_svd_cache_path)):
        
        print("Loading TF-IDF and SVD features from cache...")
        vectorizer = joblib.load(tfidf_cache_path)
        svd = joblib.load(svd_cache_path)
        train_tfidf_svd = scipy.sparse.load_npz(train_tfidf_svd_cache_path)
        test_tfidf_svd = scipy.sparse.load_npz(test_tfidf_svd_cache_path)
        
        print(f"Loaded cached TF-IDF SVD features with {svd.n_components} dimensions")
        
    else:
        print("Generating TF-IDF features...")
        
        # Configure TF-IDF vectorizer
        vectorizer = TfidfVectorizer(
            max_features=40000,  # Limit vocabulary size
            min_df=3,            # Minimum document frequency
            max_df=0.95,         # Maximum document frequency
            ngram_range=(1, 2),  # Unigrams and bigrams
            lowercase=True,
            strip_accents='unicode',
            analyzer='word',
            token_pattern=r'\w{1,}'  # Match words of at least length 1
        )
        
        # Fit and transform the training data
        train_text = train_df[text_col].fillna('').values
        test_text = test_df[text_col].fillna('').values
        
        print("Fitting TF-IDF vectorizer on training data...")
        train_tfidf = vectorizer.fit_transform(train_text)
        
        print("Transforming test data with TF-IDF vectorizer...")
        test_tfidf = vectorizer.transform(test_text)
        
        print(f"TF-IDF features shape - Train: {train_tfidf.shape}, Test: {test_tfidf.shape}")
        
        # Apply SVD for dimensionality reduction
        n_components = min(256, min(train_tfidf.shape[0], train_tfidf.shape[1]) - 1)
        print(f"Applying SVD to reduce dimensions to {n_components}...")
        
        svd = TruncatedSVD(n_components=n_components, random_state=RANDOM_SEED)
        train_tfidf_svd = svd.fit_transform(train_tfidf)
        test_tfidf_svd = svd.transform(test_tfidf)
        
        print(f"Explained variance ratio: {svd.explained_variance_ratio_.sum():.4f}")
        
        # Cache the results
        print("Caching TF-IDF and SVD features...")
        joblib.dump(vectorizer, tfidf_cache_path)
        joblib.dump(svd, svd_cache_path)
        scipy.sparse.save_npz(train_tfidf_svd_cache_path, scipy.sparse.csr_matrix(train_tfidf_svd))
        scipy.sparse.save_npz(test_tfidf_svd_cache_path, scipy.sparse.csr_matrix(test_tfidf_svd))
    
    # Create feature names
    tfidf_svd_feature_names = [f'tfidf_svd_{i}' for i in range(train_tfidf_svd.shape[1])]
    
    # Convert to DataFrame
    train_tfidf_svd_df = pd.DataFrame(
        train_tfidf_svd, 
        columns=tfidf_svd_feature_names,
        index=train_df.index
    )
    
    test_tfidf_svd_df = pd.DataFrame(
        test_tfidf_svd,
        columns=tfidf_svd_feature_names,
        index=test_df.index
    )
    
    return train_tfidf_svd_df, test_tfidf_svd_df, vectorizer, svd

# Generate TF-IDF and SVD features
train_tfidf_svd_df, test_tfidf_svd_df, tfidf_vectorizer, tfidf_svd = generate_tfidf_svd_features(
    train_encoded, test_encoded, text_col='all_text', use_cache=False
)

# Display shape of TF-IDF SVD features
print(f"TF-IDF SVD features shape - Train: {train_tfidf_svd_df.shape}, Test: {test_tfidf_svd_df.shape}")
print("\nSample of TF-IDF SVD features:")
display(train_tfidf_svd_df.iloc[:3, :5])

In [None]:
# Try to use sentence-transformers for embeddings (with fallback to TF-IDF)
def generate_sentence_transformer_features(train_df, test_df, text_col='all_text', 
                                         model_name='all-MiniLM-L6-v2', 
                                         cache_dir=CACHE_DIR, use_cache=True,
                                         max_samples=None):
    """
    Generate sentence transformer embeddings for text
    Will fall back to TF-IDF if transformers not available
    """
    
    train_st_cache_path = os.path.join(cache_dir, f'train_{model_name.replace("-", "_")}_features.npz')
    test_st_cache_path = os.path.join(cache_dir, f'test_{model_name.replace("-", "_")}_features.npz')
    
    # Check if cached files exist and use_cache is True
    if use_cache and os.path.exists(train_st_cache_path) and os.path.exists(test_st_cache_path):
        print(f"Loading {model_name} embeddings from cache...")
        train_embeddings = scipy.sparse.load_npz(train_st_cache_path).toarray()
        test_embeddings = scipy.sparse.load_npz(test_st_cache_path).toarray()
        
        print(f"Loaded cached embeddings with {train_embeddings.shape[1]} dimensions")
        
    else:
        # Try to import sentence_transformers
        try:
            from sentence_transformers import SentenceTransformer
            
            print(f"Loading {model_name} model...")
            model = SentenceTransformer(model_name)
            
            # Sample data if requested (to save memory)
            if max_samples and len(train_df) > max_samples:
                print(f"Sampling {max_samples} out of {len(train_df)} examples for embedding generation...")
                train_sample_idx = np.random.choice(len(train_df), max_samples, replace=False)
                train_text = train_df.iloc[train_sample_idx][text_col].fillna('').values
            else:
                train_text = train_df[text_col].fillna('').values
                
            test_text = test_df[text_col].fillna('').values
            
            # Generate embeddings
            print("Generating embeddings for training data...")
            train_embeddings = model.encode(train_text, show_progress_bar=True, batch_size=32)
            
            print("Generating embeddings for test data...")
            test_embeddings = model.encode(test_text, show_progress_bar=True, batch_size=32)
            
            # If we sampled, we need to create embeddings for the rest of the data
            if max_samples and len(train_df) > max_samples:
                print("Processing remaining training samples...")
                remaining_idx = [i for i in range(len(train_df)) if i not in train_sample_idx]
                batch_size = 1000
                
                # Process in batches to save memory
                full_train_embeddings = np.zeros((len(train_df), train_embeddings.shape[1]))
                full_train_embeddings[train_sample_idx] = train_embeddings
                
                for i in range(0, len(remaining_idx), batch_size):
                    batch_idx = remaining_idx[i:i+batch_size]
                    batch_text = train_df.iloc[batch_idx][text_col].fillna('').values
                    batch_embeddings = model.encode(batch_text, show_progress_bar=True, batch_size=32)
                    full_train_embeddings[batch_idx] = batch_embeddings
                    
                train_embeddings = full_train_embeddings
            
            # Cache the embeddings
            print("Caching sentence transformer embeddings...")
            scipy.sparse.save_npz(train_st_cache_path, scipy.sparse.csr_matrix(train_embeddings))
            scipy.sparse.save_npz(test_st_cache_path, scipy.sparse.csr_matrix(test_embeddings))
            
        except ImportError:
            print("Sentence transformers not available. Falling back to TF-IDF.")
            
            # Use TF-IDF as fallback
            tfidf = TfidfVectorizer(max_features=10000)
            train_text = train_df[text_col].fillna('').values
            test_text = test_df[text_col].fillna('').values
            
            train_tfidf = tfidf.fit_transform(train_text)
            test_tfidf = tfidf.transform(test_text)
            
            # Apply SVD to get embeddings
            svd = TruncatedSVD(n_components=384, random_state=RANDOM_SEED)
            train_embeddings = svd.fit_transform(train_tfidf)
            test_embeddings = svd.transform(test_tfidf)
            
            print("Generated TF-IDF + SVD fallback embeddings")
        
        except Exception as e:
            print(f"Error generating sentence transformer embeddings: {e}")
            print("Falling back to TF-IDF.")
            
            # Use TF-IDF as fallback
            tfidf = TfidfVectorizer(max_features=10000)
            train_text = train_df[text_col].fillna('').values
            test_text = test_df[text_col].fillna('').values
            
            train_tfidf = tfidf.fit_transform(train_text)
            test_tfidf = tfidf.transform(test_text)
            
            # Apply SVD to get embeddings
            svd = TruncatedSVD(n_components=384, random_state=RANDOM_SEED)
            train_embeddings = svd.fit_transform(train_tfidf)
            test_embeddings = svd.transform(test_tfidf)
            
            print("Generated TF-IDF + SVD fallback embeddings")
    
    # Create feature names
    embedding_feature_names = [f'st_emb_{i}' for i in range(train_embeddings.shape[1])]
    
    # Convert to DataFrame
    train_embedding_df = pd.DataFrame(
        train_embeddings, 
        columns=embedding_feature_names,
        index=train_df.index
    )
    
    test_embedding_df = pd.DataFrame(
        test_embeddings,
        columns=embedding_feature_names,
        index=test_df.index
    )
    
    return train_embedding_df, test_embedding_df

# Try to generate sentence transformer embeddings
try:
    train_st_df, test_st_df = generate_sentence_transformer_features(
        train_encoded, test_encoded, text_col='all_text', use_cache=False
    )
    
    # Display shape of sentence transformer embeddings
    print(f"Sentence transformer embeddings shape - Train: {train_st_df.shape}, Test: {test_st_df.shape}")
    print("\nSample of sentence transformer embeddings:")
    display(train_st_df.iloc[:3, :5])
    
    # Set flag indicating that transformer embeddings are available
    TRANSFORMER_EMBEDDINGS_AVAILABLE = True
    
except Exception as e:
    print(f"Could not generate sentence transformer embeddings: {e}")
    print("Will use TF-IDF SVD embeddings only.")
    TRANSFORMER_EMBEDDINGS_AVAILABLE = False

In [None]:
# Combine all features for model training
def prepare_features_for_modeling(train_df, test_df, tfidf_svd_df_train, tfidf_svd_df_test, 
                                st_df_train=None, st_df_test=None, use_transformer=True):
    """
    Combine all features for model training
    Returns DataFrames with features for training and testing
    """
    
    # Start with the numerical features
    numerical_features = [
        'ipq', 'text_len', 'num_words', 'avg_word_len', 'num_digits',
        'num_upper', 'num_lower', 'upper_ratio', 'num_bullets'
    ]
    
    # Add encoded categorical features
    categorical_features = ['brand_encoded']
    
    # Combine all tabular features
    tabular_features = numerical_features + categorical_features
    
    # Select only features that exist in both train and test
    existing_tabular_features = [f for f in tabular_features 
                               if f in train_df.columns and f in test_df.columns]
    
    print(f"Using {len(existing_tabular_features)} tabular features: {existing_tabular_features}")
    
    # Start with tabular features
    train_features = train_df[existing_tabular_features].copy()
    test_features = test_df[existing_tabular_features].copy()
    
    # Add TF-IDF SVD features
    print("Adding TF-IDF SVD features...")
    train_features = pd.concat([train_features, tfidf_svd_df_train], axis=1)
    test_features = pd.concat([test_features, tfidf_svd_df_test], axis=1)
    
    # Add sentence transformer features if available and requested
    if use_transformer and st_df_train is not None and st_df_test is not None:
        print("Adding sentence transformer features...")
        train_features = pd.concat([train_features, st_df_train], axis=1)
        test_features = pd.concat([test_features, st_df_test], axis=1)
    
    print(f"Final feature shapes - Train: {train_features.shape}, Test: {test_features.shape}")
    
    return train_features, test_features

# Prepare feature sets with and without transformers
if TRANSFORMER_EMBEDDINGS_AVAILABLE:
    train_features_all, test_features_all = prepare_features_for_modeling(
        train_encoded, test_encoded, 
        tfidf_svd_df_train=train_tfidf_svd_df, 
        tfidf_svd_df_test=test_tfidf_svd_df,
        st_df_train=train_st_df, 
        st_df_test=test_st_df,
        use_transformer=True
    )
else:
    train_features_all, test_features_all = prepare_features_for_modeling(
        train_encoded, test_encoded, 
        tfidf_svd_df_train=train_tfidf_svd_df, 
        tfidf_svd_df_test=test_tfidf_svd_df,
        use_transformer=False
    )

# Also create a set without transformer features for comparison
train_features_base, test_features_base = prepare_features_for_modeling(
    train_encoded, test_encoded, 
    tfidf_svd_df_train=train_tfidf_svd_df, 
    tfidf_svd_df_test=test_tfidf_svd_df,
    use_transformer=False
)

# Display shapes
print("\nFeature set shapes:")
print(f"All features - Train: {train_features_all.shape}, Test: {test_features_all.shape}")
print(f"Base features - Train: {train_features_base.shape}, Test: {test_features_base.shape}")

## Image Processing & Feature Engineering (Optional)

In [None]:
# Image downloading and processing
def setup_image_processing(train_df, test_df, image_link_col='image_link',
                         download_images_func=download_images, 
                         max_images=None, timeout=60):
    """Set up image processing functionality with error handling"""
    
    # Check if torch and torchvision are available
    try:
        import torch
        import torchvision
        from torchvision import models, transforms
        
        print("PyTorch and TorchVision are available for image processing")
        TORCH_AVAILABLE = True
    except ImportError:
        print("PyTorch or TorchVision not available. Image features will not be used.")
        return None, None, False
    
    # Define image download function
    def download_product_images(df, output_dir, max_samples=None):
        """Download product images with retry and timeout"""
        if not os.path.exists(output_dir):
            os.makedirs(output_dir, exist_ok=True)
            
        if max_samples and len(df) > max_samples:
            print(f"Sampling {max_samples} out of {len(df)} images to download")
            df_sample = df.sample(max_samples, random_state=RANDOM_SEED)
        else:
            df_sample = df
        
        # Extract image links and sample IDs
        image_links = df_sample[image_link_col].values
        sample_ids = df_sample['sample_id'].values
        
        # Create a mapping of sample ID to image filename
        sample_id_to_filename = {}
        
        try:
            print(f"Downloading {len(image_links)} images to {output_dir}")
            download_images_func(image_links, output_dir)
            
            # Map sample IDs to downloaded filenames
            for i, (sample_id, image_link) in enumerate(zip(sample_ids, image_links)):
                if isinstance(image_link, str):
                    filename = Path(image_link).name
                    image_path = os.path.join(output_dir, filename)
                    if os.path.exists(image_path):
                        sample_id_to_filename[sample_id] = image_path
            
            print(f"Successfully downloaded {len(sample_id_to_filename)} images")
            return sample_id_to_filename
            
        except Exception as e:
            print(f"Error downloading images: {e}")
            return {}
    
    # Define feature extraction function
    def extract_image_features(sample_id_to_filename, model_name='efficientnet_b0', cache_dir=CACHE_DIR):
        """Extract features from images using pretrained model"""
        cache_path = os.path.join(cache_dir, f'image_features_{model_name}.pkl')
        
        # Check if cache exists
        if os.path.exists(cache_path):
            print(f"Loading cached image features from {cache_path}")
            sample_id_to_features = joblib.load(cache_path)
            return sample_id_to_features
        
        try:
            # Load pretrained model
            if model_name == 'efficientnet_b0':
                model = models.efficientnet_b0(pretrained=True)
            elif model_name == 'resnet18':
                model = models.resnet18(pretrained=True)
            else:
                print(f"Unknown model: {model_name}. Using efficientnet_b0 instead.")
                model = models.efficientnet_b0(pretrained=True)
            
            # Remove classification head
            if model_name.startswith('efficientnet'):
                model = torch.nn.Sequential(*(list(model.children())[:-1]))
            else:  # ResNet
                model = torch.nn.Sequential(*list(model.children())[:-1])
            
            # Set model to evaluation mode
            model.eval()
            
            # Move model to GPU if available
            if torch.cuda.is_available():
                model = model.cuda()
            
            # Define image transforms
            transform = transforms.Compose([
                transforms.Resize(256),
                transforms.CenterCrop(224),
                transforms.ToTensor(),
                transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
            ])
            
            # Extract features for each image
            sample_id_to_features = {}
            for sample_id, image_path in tqdm(sample_id_to_filename.items(), 
                                             desc="Extracting image features"):
                try:
                    # Load and transform image
                    image = Image.open(image_path).convert('RGB')
                    image_tensor = transform(image).unsqueeze(0)
                    
                    # Move image to GPU if available
                    if torch.cuda.is_available():
                        image_tensor = image_tensor.cuda()
                    
                    # Extract features
                    with torch.no_grad():
                        features = model(image_tensor)
                    
                    # Convert to numpy array
                    if torch.cuda.is_available():
                        features = features.cpu()
                    features = features.squeeze().numpy()
                    
                    # Store features
                    sample_id_to_features[sample_id] = features
                    
                except Exception as e:
                    print(f"Error extracting features for image {image_path}: {e}")
            
            print(f"Extracted features for {len(sample_id_to_features)} images")
            
            # Cache the features
            joblib.dump(sample_id_to_features, cache_path)
            
            return sample_id_to_features
            
        except Exception as e:
            print(f"Error setting up image feature extraction: {e}")
            return {}
    
    # Create function to convert image features to DataFrame
    def image_features_to_dataframe(df, sample_id_to_features, prefix='img_'):
        """Convert image features to DataFrame"""
        # Create empty DataFrame for image features
        if not sample_id_to_features:
            print("No image features available")
            return pd.DataFrame(index=df.index)
        
        # Get feature dimensionality from the first feature
        first_feature = next(iter(sample_id_to_features.values()))
        n_dims = len(first_feature)
        feature_names = [f'{prefix}{i}' for i in range(n_dims)]
        
        # Initialize DataFrame with zeros
        image_features_df = pd.DataFrame(
            np.zeros((len(df), n_dims)),
            columns=feature_names,
            index=df.index
        )
        
        # Fill in available features
        for i, row in df.iterrows():
            sample_id = row['sample_id']
            if sample_id in sample_id_to_features:
                image_features_df.loc[i, feature_names] = sample_id_to_features[sample_id]
        
        return image_features_df
    
    return {
        'download_images': download_product_images,
        'extract_features': extract_image_features,
        'features_to_df': image_features_to_dataframe
    }, TORCH_AVAILABLE

# Check if image processing is available and set up functions
image_processing, torch_available_for_images = setup_image_processing(
    train_encoded, test_encoded, image_link_col='image_link'
)

# Set up image processing if available
if image_processing:
    print("Image processing is available. Will attempt to download and process images.")
    USE_IMAGES = True
else:
    print("Image processing is not available. Will proceed without image features.")
    USE_IMAGES = False

In [None]:
# Download and process images (optional, only if image processing is available)
if USE_IMAGES:
    # Create directories for images
    train_images_dir = os.path.join(OUTPUT_PATH, 'train_images')
    test_images_dir = os.path.join(OUTPUT_PATH, 'test_images')
    os.makedirs(train_images_dir, exist_ok=True)
    os.makedirs(test_images_dir, exist_ok=True)
    
    # Download a sample of images (to save time)
    max_train_images = 5000  # Limit to 5000 training images to save time
    max_test_images = 1000   # Limit to 1000 test images
    
    print("Downloading training images...")
    train_sample_id_to_filename = image_processing['download_images'](
        train_encoded, train_images_dir, max_samples=max_train_images
    )
    
    print("Downloading test images...")
    test_sample_id_to_filename = image_processing['download_images'](
        test_encoded, test_images_dir, max_samples=max_test_images
    )
    
    # Extract image features
    print("Extracting image features...")
    train_sample_id_to_features = image_processing['extract_features'](
        train_sample_id_to_filename, model_name='efficientnet_b0'
    )
    
    test_sample_id_to_features = image_processing['extract_features'](
        test_sample_id_to_filename, model_name='efficientnet_b0'
    )
    
    # Convert image features to DataFrame
    print("Converting image features to DataFrame...")
    train_image_features_df = image_processing['features_to_df'](
        train_encoded, train_sample_id_to_features
    )
    
    test_image_features_df = image_processing['features_to_df'](
        test_encoded, test_sample_id_to_features
    )
    
    # Add image features to the feature set
    print("Adding image features to feature set...")
    train_features_with_images = pd.concat([train_features_all, train_image_features_df], axis=1)
    test_features_with_images = pd.concat([test_features_all, test_image_features_df], axis=1)
    
    print(f"Features with images - Train: {train_features_with_images.shape}, Test: {test_features_with_images.shape}")
    
    # Image feature count
    image_feature_count = train_image_features_df.shape[1]
    print(f"Added {image_feature_count} image features")
else:
    print("Skipping image processing. Will proceed without image features.")
    train_features_with_images = train_features_all
    test_features_with_images = test_features_all

## Model Training & Cross-Validation

In [None]:
# Define evaluation metric (SMAPE)
def smape(y_true, y_pred):
    """
    Symmetric Mean Absolute Percentage Error
    """
    # Convert to numpy arrays if they're not already
    y_true = np.array(y_true)
    y_pred = np.array(y_pred)
    
    # Ensure no zeros or negative values (to avoid division by zero)
    y_true = np.maximum(y_true, 0.01)
    y_pred = np.maximum(y_pred, 0.01)
    
    # Calculate SMAPE
    return 100/len(y_true) * np.sum(2 * np.abs(y_pred - y_true) / (np.abs(y_true) + np.abs(y_pred)))

# Prepare target variable (log-transformed price)
if 'price' in train_encoded.columns:
    print("Log-transforming price for training...")
    train_encoded['log_price'] = np.log1p(train_encoded['price'])
    
    # Display target distribution after transformation
    plt.figure(figsize=(10, 6))
    sns.histplot(train_encoded['log_price'], bins=50, kde=True)
    plt.title('Log-transformed Price Distribution')
    plt.xlabel('Log(Price + 1)')
    plt.ylabel('Frequency')
    plt.show()
    
    # Handle outliers in the target variable
    # Calculate percentile thresholds for capping
    upper_threshold = np.percentile(train_encoded['log_price'], 99.9)
    
    print(f"Capping log_price at the 99.9th percentile: {upper_threshold:.4f}")
    train_encoded['log_price_capped'] = np.minimum(train_encoded['log_price'], upper_threshold)
    
    # Use capped version for training
    y = train_encoded['log_price_capped']
    
    print(f"Target prepared. Shape: {y.shape}")
else:
    print("Price column not found in training data. Cannot proceed with model training.")
    y = None

# Function to create stratification bins for cross-validation
def create_strat_bins(y, n_bins=10):
    """Create bins for stratified cross-validation"""
    return pd.qcut(y, n_bins, labels=False, duplicates='drop')

# Set up cross-validation
def setup_cross_validation(X, y, n_splits=5, n_bins=10, random_state=RANDOM_SEED):
    """Set up stratified K-fold cross-validation"""
    if y is None:
        print("No target variable available. Cannot set up cross-validation.")
        return None
    
    # Create bins for stratification
    bins = create_strat_bins(y, n_bins)
    
    # Set up K-fold cross-validation
    kf = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=random_state)
    
    # Generate fold indices
    fold_indices = []
    for train_idx, valid_idx in kf.split(X, bins):
        fold_indices.append((train_idx, valid_idx))
    
    return fold_indices

# Set up cross-validation folds
cv_folds = setup_cross_validation(train_features_all, y, n_splits=5, n_bins=10)

if cv_folds:
    print(f"Cross-validation setup complete with {len(cv_folds)} folds.")

In [None]:
# Train models with cross-validation
def train_model_with_cv(X, y, model_class, model_params, folds, 
                      feature_type='all', model_name='model', 
                      cache_dir=CACHE_DIR):
    """
    Train model with cross-validation
    Returns trained model, OOF predictions, and test predictions
    """
    if y is None:
        print("No target variable available. Cannot train model.")
        return None, None, None
    
    # Initialize arrays for OOF predictions
    oof_preds = np.zeros(len(X))
    fold_scores = []
    models = []
    
    print(f"\nTraining {model_name} with {feature_type} features")
    print(f"Feature shape: {X.shape}")
    
    # Loop through folds
    for fold, (train_idx, valid_idx) in enumerate(folds):
        print(f"Fold {fold+1}/{len(folds)}")
        
        # Split data
        X_train, X_valid = X.iloc[train_idx], X.iloc[valid_idx]
        y_train, y_valid = y.iloc[train_idx], y.iloc[valid_idx]
        
        # Initialize and fit model
        model = model_class(**model_params)
        
        # Special handling for LightGBM
        if isinstance(model, lgb.LGBMRegressor):
            model.fit(
                X_train, y_train,
                eval_set=[(X_valid, y_valid)],
                eval_metric='rmse',
                early_stopping_rounds=50,
                verbose=100
            )
        else:
            model.fit(X_train, y_train)
        
        # Make predictions on validation set
        valid_preds = model.predict(X_valid)
        
        # Store OOF predictions
        oof_preds[valid_idx] = valid_preds
        
        # Transform predictions back to original scale
        valid_preds_original = np.expm1(valid_preds)
        y_valid_original = np.expm1(y_valid)
        
        # Calculate SMAPE
        fold_smape = smape(y_valid_original, valid_preds_original)
        fold_scores.append(fold_smape)
        
        print(f"Fold {fold+1} SMAPE: {fold_smape:.4f}")
        
        # Store model
        models.append(model)
    
    # Calculate overall score
    mean_score = np.mean(fold_scores)
    print(f"Mean SMAPE across {len(folds)} folds: {mean_score:.4f}")
    
    # Save models
    model_path = os.path.join(cache_dir, f"{model_name}_{feature_type}_models.pkl")
    joblib.dump(models, model_path)
    
    # Return results
    return {
        'models': models,
        'oof_preds': oof_preds,
        'fold_scores': fold_scores,
        'mean_score': mean_score
    }

# Define models to train
models_to_train = [
    {
        'name': 'ridge',
        'class': Ridge,
        'params': {'alpha': 1.0, 'random_state': RANDOM_SEED}
    },
    {
        'name': 'lightgbm',
        'class': lgb.LGBMRegressor,
        'params': {
            'n_estimators': 1000,
            'learning_rate': 0.05,
            'num_leaves': 31,
            'colsample_bytree': 0.8,
            'subsample': 0.8,
            'reg_alpha': 0.1,
            'reg_lambda': 0.1,
            'n_jobs': -1,
            'random_state': RANDOM_SEED
        }
    }
]

# Train models with cross-validation
model_results = {}
for model_config in models_to_train:
    model_name = model_config['name']
    model_class = model_config['class']
    model_params = model_config['params']
    
    # Train on base features (TF-IDF only)
    base_result = train_model_with_cv(
        train_features_base, y, model_class, model_params, cv_folds,
        feature_type='base', model_name=model_name
    )
    model_results[f'{model_name}_base'] = base_result
    
    # Train on all features (including transformer embeddings if available)
    all_result = train_model_with_cv(
        train_features_all, y, model_class, model_params, cv_folds,
        feature_type='all', model_name=model_name
    )
    model_results[f'{model_name}_all'] = all_result
    
    # Train on all features + images if image features are available
    if USE_IMAGES:
        img_result = train_model_with_cv(
            train_features_with_images, y, model_class, model_params, cv_folds,
            feature_type='with_images', model_name=model_name
        )
        model_results[f'{model_name}_with_images'] = img_result

# Summarize model results
print("\nModel performance summary:")
for model_name, result in model_results.items():
    if result:  # Check that result is not None
        print(f"{model_name}: Mean SMAPE = {result['mean_score']:.4f}, Fold SMAPEs = {[f'{score:.4f}' for score in result['fold_scores']]}")

# Identify best model based on mean score
best_model_name = min(model_results.keys(), key=lambda k: model_results[k]['mean_score'] if model_results[k] else float('inf'))
best_model_result = model_results[best_model_name]
print(f"\nBest model: {best_model_name} with Mean SMAPE = {best_model_result['mean_score']:.4f}")

## Ensemble & Stacking

In [None]:
# Create stacking ensemble
def create_stacking_ensemble(base_models_results, X_train, X_test, y, folds, 
                          meta_model=Ridge(alpha=0.5), 
                          model_name='stacking_ensemble'):
    """
    Create a stacking ensemble from base models
    Returns meta-model and predictions
    """
    if y is None:
        print("No target variable available. Cannot create ensemble.")
        return None, None, None
    
    print(f"\nCreating stacking ensemble with {len(base_models_results)} base models")
    
    # Extract OOF predictions from base models
    oof_preds_dict = {name: result['oof_preds'] for name, result in base_models_results.items() if result}
    
    # Create a DataFrame of OOF predictions
    oof_preds_df = pd.DataFrame(oof_preds_dict)
    
    # Initialize arrays for meta-model OOF predictions
    meta_oof_preds = np.zeros(len(X_train))
    fold_scores = []
    meta_models = []
    
    # Loop through folds
    for fold, (train_idx, valid_idx) in enumerate(folds):
        print(f"Training meta-model on fold {fold+1}/{len(folds)}")
        
        # Split data
        meta_X_train = oof_preds_df.iloc[train_idx]
        meta_X_valid = oof_preds_df.iloc[valid_idx]
        y_train, y_valid = y.iloc[train_idx], y.iloc[valid_idx]
        
        # Train meta-model
        meta_model_fold = clone(meta_model)
        meta_model_fold.fit(meta_X_train, y_train)
        
        # Make predictions on validation set
        valid_preds = meta_model_fold.predict(meta_X_valid)
        
        # Store OOF predictions
        meta_oof_preds[valid_idx] = valid_preds
        
        # Transform predictions back to original scale
        valid_preds_original = np.expm1(valid_preds)
        y_valid_original = np.expm1(y_valid)
        
        # Calculate SMAPE
        fold_smape = smape(y_valid_original, valid_preds_original)
        fold_scores.append(fold_smape)
        
        print(f"Meta-model fold {fold+1} SMAPE: {fold_smape:.4f}")
        
        # Store model
        meta_models.append(meta_model_fold)
    
    # Calculate overall score
    mean_score = np.mean(fold_scores)
    print(f"Meta-model mean SMAPE across {len(folds)} folds: {mean_score:.4f}")
    
    # Generate test predictions using all models
    test_preds_dict = {}
    for name, result in base_models_results.items():
        if not result:
            continue
            
        models = result['models']
        test_preds_list = []
        
        # Get predictions from each fold model
        for fold_model in models:
            test_preds_fold = fold_model.predict(X_test)
            test_preds_list.append(test_preds_fold)
        
        # Average predictions across folds
        test_preds_dict[name] = np.mean(test_preds_list, axis=0)
    
    # Create DataFrame of test predictions
    test_preds_df = pd.DataFrame(test_preds_dict)
    
    # Make meta-model predictions on test data
    meta_test_preds_list = []
    for meta_model_fold in meta_models:
        meta_test_preds_fold = meta_model_fold.predict(test_preds_df)
        meta_test_preds_list.append(meta_test_preds_fold)
    
    # Average meta-model predictions across folds
    meta_test_preds = np.mean(meta_test_preds_list, axis=0)
    
    # Save meta-model
    meta_model_path = os.path.join(CACHE_DIR, f"{model_name}_meta_models.pkl")
    joblib.dump(meta_models, meta_model_path)
    
    # Return results
    return {
        'meta_models': meta_models,
        'oof_preds': meta_oof_preds,
        'test_preds': meta_test_preds,
        'fold_scores': fold_scores,
        'mean_score': mean_score
    }

# Create simple weighted ensemble
def create_weighted_ensemble(base_models_results, weights=None):
    """
    Create a weighted ensemble from base models
    Returns weighted predictions
    """
    print("\nCreating weighted ensemble")
    
    # Extract OOF predictions and scores from base models
    oof_preds_dict = {}
    test_preds_dict = {}
    scores_dict = {}
    
    for name, result in base_models_results.items():
        if result and 'oof_preds' in result and 'mean_score' in result:
            oof_preds_dict[name] = result['oof_preds']
            scores_dict[name] = result['mean_score']
            
            # Extract test predictions if available
            if 'test_preds' in result:
                test_preds_dict[name] = result['test_preds']
    
    # If no weights provided, use inverse of scores as weights
    if weights is None and scores_dict:
        # Convert scores to weights (lower score = higher weight)
        weights = {}
        for name, score in scores_dict.items():
            # Avoid division by zero
            if score > 0:
                weights[name] = 1 / score
            else:
                weights[name] = 1.0
                
        # Normalize weights to sum to 1
        total_weight = sum(weights.values())
        for name in weights:
            weights[name] /= total_weight
    else:
        # Use equal weights if no scores available or weights provided
        weights = {name: 1/len(base_models_results) for name in base_models_results}
    
    print("Ensemble weights:")
    for name, weight in weights.items():
        print(f"  {name}: {weight:.4f}")
    
    # Create weighted OOF predictions
    oof_preds_weighted = np.zeros(len(next(iter(oof_preds_dict.values()))))
    for name, preds in oof_preds_dict.items():
        oof_preds_weighted += weights[name] * preds
    
    # Create weighted test predictions if available
    if test_preds_dict:
        test_preds_weighted = np.zeros(len(next(iter(test_preds_dict.values()))))
        for name, preds in test_preds_dict.items():
            if name in weights:
                test_preds_weighted += weights[name] * preds
    else:
        test_preds_weighted = None
    
    return {
        'oof_preds': oof_preds_weighted,
        'test_preds': test_preds_weighted,
        'weights': weights
    }

# Prepare base models for ensembling
if y is not None:
    # Select models to include in ensembles
    base_models_for_stacking = {
        name: result for name, result in model_results.items()
        if result and 'oof_preds' in result
    }
    
    # Filter out models that don't have test predictions yet
    # We'll need to make predictions with the base models first
    for name, result in base_models_for_stacking.items():
        if 'models' in result:
            models = result['models']
            
            # Determine which feature set to use
            if 'with_images' in name and USE_IMAGES:
                X_test_features = test_features_with_images
            elif 'all' in name:
                X_test_features = test_features_all
            else:
                X_test_features = test_features_base
            
            # Make predictions on test data
            test_preds_list = []
            for fold_model in models:
                test_preds_fold = fold_model.predict(X_test_features)
                test_preds_list.append(test_preds_fold)
            
            # Average predictions across folds
            result['test_preds'] = np.mean(test_preds_list, axis=0)
            print(f"Generated test predictions for {name}")
    
    # Create stacking ensemble with Ridge as meta-model
    stacking_result = create_stacking_ensemble(
        base_models_for_stacking,
        train_features_base,  # Use base features for the meta-model
        test_features_base,
        y,
        cv_folds,
        meta_model=Ridge(alpha=0.5),
        model_name='stacking_ensemble'
    )
    
    # Create weighted ensemble
    weighted_result = create_weighted_ensemble(base_models_for_stacking)
    
    # Determine which ensemble performed better on OOF data
    if stacking_result and weighted_result:
        # Transform OOF predictions back to original scale
        stacking_oof_original = np.expm1(stacking_result['oof_preds'])
        weighted_oof_original = np.expm1(weighted_result['oof_preds'])
        y_original = np.expm1(y)
        
        # Calculate SMAPE
        stacking_smape = smape(y_original, stacking_oof_original)
        weighted_smape = smape(y_original, weighted_oof_original)
        
        print(f"Stacking ensemble SMAPE: {stacking_smape:.4f}")
        print(f"Weighted ensemble SMAPE: {weighted_smape:.4f}")
        
        # Choose the better ensemble
        if stacking_smape < weighted_smape:
            print("Using stacking ensemble for final predictions")
            final_ensemble = 'stacking'
            final_test_preds = stacking_result['test_preds']
        else:
            print("Using weighted ensemble for final predictions")
            final_ensemble = 'weighted'
            final_test_preds = weighted_result['test_preds']
            
        # Save ensemble results
        ensemble_results = {
            'stacking': {
                'oof_preds': stacking_result['oof_preds'],
                'test_preds': stacking_result['test_preds'],
                'smape': stacking_smape
            },
            'weighted': {
                'oof_preds': weighted_result['oof_preds'],
                'test_preds': weighted_result['test_preds'],
                'smape': weighted_smape
            },
            'final_ensemble': final_ensemble
        }
        
        # Save ensemble results to disk
        joblib.dump(ensemble_results, os.path.join(CACHE_DIR, 'ensemble_results.pkl'))
        
    else:
        print("Could not create ensembles. Using best base model instead.")
        final_test_preds = model_results[best_model_name]['test_preds'] 
else:
    print("No target variable available. Cannot create ensembles.")
    final_test_preds = None

## Prediction & Submission Generation

In [None]:
# Generate final predictions and create submission
def generate_submission(test_df, test_preds, output_path=OUTPUT_CSV_PATH):
    """Generate submission file with predictions"""
    if test_preds is None:
        print("No predictions available. Cannot create submission file.")
        return None
    
    # Convert log predictions back to original scale
    test_preds_original = np.expm1(test_preds)
    
    # Clip to reasonable range (minimum 0.01)
    test_preds_clipped = np.maximum(test_preds_original, 0.01)
    
    # Create submission DataFrame
    submission = pd.DataFrame({
        'sample_id': test_df['sample_id'],
        'price': test_preds_clipped
    })
    
    # Save to CSV
    print(f"Saving submission to {output_path}")
    submission.to_csv(output_path, index=False)
    
    # Print submission statistics
    print("\nSubmission statistics:")
    print(f"Number of rows: {len(submission)}")
    print(f"Min price: {submission['price'].min():.4f}")
    print(f"Max price: {submission['price'].max():.4f}")
    print(f"Mean price: {submission['price'].mean():.4f}")
    print(f"Median price: {submission['price'].median():.4f}")
    
    return submission

# Generate and save submission file
if final_test_preds is not None:
    submission = generate_submission(test_encoded, final_test_preds)
    
    # Save metrics to JSON
    if y is not None:
        metrics = {
            'timestamp': datetime.now().strftime('%Y-%m-%d %H:%M:%S'),
            'model_results': {name: {'mean_smape': result['mean_score'], 
                                   'fold_smapes': result['fold_scores']} 
                           for name, result in model_results.items() if result},
            'ensemble_results': {
                'stacking_smape': stacking_smape,
                'weighted_smape': weighted_smape,
                'final_ensemble': final_ensemble
            }
        }
        
        # Save to JSON
        with open(METRICS_PATH, 'w') as f:
            json.dump(metrics, f, indent=2)
        
        print(f"Saved metrics to {METRICS_PATH}")
else:
    print("No predictions available. Cannot create submission file.")

# Print final performance
if y is not None:
    # Print OOF performance
    print("\n----- Final OOF Performance -----")
    print(f"Best base model ({best_model_name}): {model_results[best_model_name]['mean_score']:.4f}")
    
    # Print ensemble performance if available
    if 'stacking_smape' in locals() and 'weighted_smape' in locals():
        print(f"Stacking ensemble: {stacking_smape:.4f}")
        print(f"Weighted ensemble: {weighted_smape:.4f}")
        
    # Print final ensemble choice
    if 'final_ensemble' in locals():
        print(f"Final ensemble: {final_ensemble}")
        
    # Print per-fold performance of best model
    print("\nPer-fold SMAPE of best model:")
    for i, score in enumerate(model_results[best_model_name]['fold_scores']):
        print(f"Fold {i+1}: {score:.4f}")
    
    # Print submission file path
    print(f"\nSubmission file: {OUTPUT_CSV_PATH}")
else:
    print("No target variable available. Cannot evaluate performance.")

## Quick Baseline (If You're in a Hurry)

In [None]:
# Quick Baseline - TF-IDF + Ridge Regression
def run_quick_baseline():
    """
    Run a quick baseline model using TF-IDF + Ridge Regression
    This should complete in under 10 minutes
    """
    print("Running quick baseline model (TF-IDF + Ridge)")
    
    # Start timer
    start_time = time.time()
    
    # Load data
    try:
        train = pd.read_csv(TRAIN_PATH)
        test = pd.read_csv(TEST_PATH)
        print(f"Train data shape: {train.shape}")
        print(f"Test data shape: {test.shape}")
    except FileNotFoundError as e:
        print(f"Error loading data: {e}")
        print("Trying to load sample test data instead...")
        try:
            train = pd.read_csv(SAMPLE_TEST_PATH)
            sample_test_out = pd.read_csv(SAMPLE_TEST_OUT_PATH)
            train = pd.merge(train, sample_test_out, on='sample_id', how='inner')
            test = train.copy()  # For demonstration purposes only
            print(f"Sample test data shape: {train.shape}")
        except FileNotFoundError as e2:
            print(f"Error loading sample data: {e2}")
            print("Cannot proceed without data.")
            return
    
    # Verify price column exists
    if 'price' not in train.columns:
        print("Price column not found in training data.")
        return
    
    # Basic cleaning of catalog_content
    print("Cleaning text...")
    train['clean_text'] = train['catalog_content'].fillna('').str.lower()
    test['clean_text'] = test['catalog_content'].fillna('').str.lower()
    
    # Extract basic features
    print("Extracting basic features...")
    train['text_len'] = train['clean_text'].str.len()
    test['text_len'] = test['clean_text'].str.len()
    
    # Generate TF-IDF features
    print("Generating TF-IDF features...")
    tfidf = TfidfVectorizer(
        max_features=20000,  # Limit features for speed
        min_df=3,
        max_df=0.95,
        ngram_range=(1, 2)  # Unigrams and bigrams
    )
    
    train_text = train['clean_text'].fillna('').values
    test_text = test['clean_text'].fillna('').values
    
    train_tfidf = tfidf.fit_transform(train_text)
    test_tfidf = tfidf.transform(test_text)
    
    print(f"TF-IDF features shape - Train: {train_tfidf.shape}, Test: {test_tfidf.shape}")
    
    # Apply SVD for dimensionality reduction
    print("Applying SVD...")
    n_components = 100  # Smaller for speed
    svd = TruncatedSVD(n_components=n_components, random_state=RANDOM_SEED)
    train_tfidf_svd = svd.fit_transform(train_tfidf)
    test_tfidf_svd = svd.transform(test_tfidf)
    
    print(f"SVD features shape - Train: {train_tfidf_svd.shape}, Test: {test_tfidf_svd.shape}")
    
    # Add text_len feature
    train_features = np.hstack([train_tfidf_svd, train['text_len'].values.reshape(-1, 1)])
    test_features = np.hstack([test_tfidf_svd, test['text_len'].values.reshape(-1, 1)])
    
    # Log-transform the target
    y = np.log1p(train['price'])
    
    # Train Ridge model
    print("Training Ridge model...")
    ridge = Ridge(alpha=1.0, random_state=RANDOM_SEED)
    ridge.fit(train_features, y)
    
    # Make predictions
    print("Making predictions...")
    test_preds = ridge.predict(test_features)
    
    # Convert back to original scale
    test_preds_original = np.expm1(test_preds)
    
    # Clip to reasonable range (minimum 0.01)
    test_preds_clipped = np.maximum(test_preds_original, 0.01)
    
    # Create submission DataFrame
    submission = pd.DataFrame({
        'sample_id': test['sample_id'],
        'price': test_preds_clipped
    })
    
    # Save to CSV
    output_path = os.path.join(OUTPUT_PATH, 'quick_baseline_submission.csv')
    submission.to_csv(output_path, index=False)
    
    # Calculate elapsed time
    elapsed_time = time.time() - start_time
    print(f"Quick baseline completed in {elapsed_time:.2f} seconds ({elapsed_time/60:.2f} minutes)")
    
    # Print submission statistics
    print("\nSubmission statistics:")
    print(f"Number of rows: {len(submission)}")
    print(f"Min price: {submission['price'].min():.4f}")
    print(f"Max price: {submission['price'].max():.4f}")
    print(f"Mean price: {submission['price'].mean():.4f}")
    print(f"Median price: {submission['price'].median():.4f}")
    
    print(f"\nSubmission file: {output_path}")
    
    return submission

# Uncomment the line below to run the quick baseline
# quick_baseline_submission = run_quick_baseline()

## Utility Functions

In [None]:
# Define utility functions that can be reused across the notebook

# Function to generate a command-line runnable script
def generate_train_predict_script(output_path=None):
    """Generate a command-line runnable script for training and prediction"""
    if output_path is None:
        output_path = os.path.join(OUTPUT_PATH, 'train_predict.py')
    
    script_content = '''#!/usr/bin/env python
# -*- coding: utf-8 -*-
"""
Train and predict script for Smart Product Pricing Challenge
"""

import os
import sys
import re
import argparse
import numpy as np
import pandas as pd
import pickle
import joblib
from pathlib import Path
import warnings
import time
from functools import partial
import multiprocessing
from datetime import datetime
import json

# Suppress warnings
warnings.filterwarnings('ignore')

# Machine learning imports
from sklearn.feature_extraction.text import TfidfVectorizer, HashingVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.linear_model import Ridge
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import StratifiedKFold
import lightgbm as lgb

# Set random seeds
RANDOM_SEED = 42
np.random.seed(RANDOM_SEED)
import random
random.seed(RANDOM_SEED)

# Optional torch imports
try:
    import torch
    torch.manual_seed(RANDOM_SEED)
    if torch.cuda.is_available():
        torch.cuda.manual_seed_all(RANDOM_SEED)
    TORCH_AVAILABLE = True
except ImportError:
    TORCH_AVAILABLE = False


def setup_paths(base_path=None, output_path=None, dataset_path=None):
    """Set up paths for data and output"""
    if base_path is None:
        # Auto-detect Kaggle environment
        if os.path.exists('/kaggle/input'):
            base_path = '/kaggle/input'
        else:
            base_path = '.'
    
    if output_path is None:
        # Auto-detect Kaggle environment
        if os.path.exists('/kaggle/working'):
            output_path = '/kaggle/working'
        else:
            output_path = '.'
    
    if dataset_path is None:
        # Try to find dataset path
        if os.path.exists(os.path.join(base_path, 'train.csv')):
            dataset_path = base_path
        else:
            # Look in common locations
            possible_paths = [
                os.path.join(base_path, 'dataset'),
                os.path.join(base_path, 'data'),
                os.path.join(base_path, 'smart-product-pricing-challenge'),
                os.path.join(base_path, 'student_resource', 'dataset')
            ]
            for path in possible_paths:
                if os.path.exists(os.path.join(path, 'train.csv')):
                    dataset_path = path
                    break
    
    # Create output directory
    os.makedirs(output_path, exist_ok=True)
    
    # Create cache directory
    cache_dir = os.path.join(output_path, 'cache')
    os.makedirs(cache_dir, exist_ok=True)
    
    return {
        'base_path': base_path,
        'output_path': output_path,
        'dataset_path': dataset_path,
        'cache_dir': cache_dir,
        'train_path': os.path.join(dataset_path, 'train.csv') if dataset_path else None,
        'test_path': os.path.join(dataset_path, 'test.csv') if dataset_path else None,
        'output_csv_path': os.path.join(output_path, 'test_out.csv'),
        'metrics_path': os.path.join(output_path, 'oof_metrics.json')
    }


def clean_text(text):
    """Clean text by removing URLs, special characters, and converting to lowercase"""
    if not isinstance(text, str):
        return ""
        
    # Convert to lowercase
    text = text.lower()
    
    # Remove URLs
    text = re.sub(r'http\S+|www\S+|https\S+', '', text)
    
    # Remove excessive punctuation
    text = re.sub(r'[^\\w\\s]', ' ', text)
    
    # Remove extra spaces
    text = re.sub(r'\\s+', ' ', text).strip()
    
    return text


def extract_ipq(text):
    """Extract Item Pack Quantity (IPQ) from text"""
    if not isinstance(text, str):
        return 1
    
    text = text.lower()
    
    # Look for specific patterns indicating pack quantity
    patterns = [
        r'pack of (\\d+)',
        r'(\\d+)[-\\s]pack',
        r'(\\d+)\\s*pcs',
        r'(\\d+)\\s*pieces',
        r'(\\d+)\\s*count',
        r'(\\d+)\\s*ct',
        r'(\\d+)\\s*pk',
        r'set of (\\d+)',
        r'(\\d+)\\s*set',
        r'(\\d+)\\s*qty',
        r'quantity:\\s*(\\d+)',
        r'qty:\\s*(\\d+)',
        r'count:\\s*(\\d+)',
        r'value:\\s*(\\d+)',
    ]
    
    for pattern in patterns:
        match = re.search(pattern, text)
        if match:
            try:
                quantity = int(match.group(1))
                return max(1, min(quantity, 100))  # Cap at reasonable values
            except:
                pass
    
    # Check for 'Value: X' pattern which often indicates quantity
    value_match = re.search(r'value:\\s*([\\d\\.]+)', text)
    if value_match:
        try:
            value = float(value_match.group(1))
            if value >= 1 and value <= 100:
                return int(value)
        except:
            pass
            
    # Default to 1 if no pattern is found
    return 1


def extract_brand(text):
    """Extract brand name from text using heuristics"""
    if not isinstance(text, str):
        return "Unknown"
    
    # Look for common brand patterns
    brand_patterns = [
        r'brand:\\s*([A-Za-z0-9][A-Za-z0-9\\s&\\-]+)',
        r'by\\s+([A-Z][A-Za-z0-9\\s&\\-]+)',
        r'from\\s+([A-Z][A-Za-z0-9\\s&\\-]+)',
        r'item name:\\s*([A-Z][A-Za-z0-9\\s&\\-]+)'
    ]
    
    for pattern in brand_patterns:
        match = re.search(pattern, text, re.IGNORECASE)
        if match:
            brand = match.group(1).strip()
            # Limit length and filter out generic terms
            if len(brand) > 1 and len(brand) < 30:
                return brand
    
    # Try to extract first word from Item Name if it's uppercase
    item_name_match = re.search(r'item name:([^,\\n]+)', text, re.IGNORECASE)
    if item_name_match:
        item_name = item_name_match.group(1).strip()
        first_word = item_name.split()[0] if item_name.split() else ""
        if first_word and first_word[0].isupper() and len(first_word) > 1:
            return first_word
    
    # Try the first word if it's all caps or first letter is capitalized
    words = text.split()
    if words and len(words[0]) > 1:
        if words[0].isupper() or (words[0][0].isupper() and not words[0].isupper()):
            return words[0]
    
    return "Unknown"


def extract_title(text):
    """Extract title from catalog content"""
    if not isinstance(text, str):
        return ""
    
    # Try to find item name pattern
    item_name_match = re.search(r'item name:(.*?)(?:bullet point|product description|$)', 
                               text, re.IGNORECASE | re.DOTALL)
    
    if item_name_match:
        title = item_name_match.group(1).strip()
        return title
    
    # If no specific pattern, take the first line or first 100 characters
    lines = text.split('\\n')
    if lines:
        return lines[0].strip()
    
    return text[:100] if len(text) > 100 else text


def extract_description(text):
    """Extract product description from catalog content"""
    if not isinstance(text, str):
        return ""
    
    # Try to find product description pattern
    desc_match = re.search(r'product description:(.*?)(?:value:|unit:|$)', 
                           text, re.IGNORECASE | re.DOTALL)
    
    if desc_match:
        description = desc_match.group(1).strip()
        return description
    
    # If no specific pattern, take everything after the first line
    lines = text.split('\\n')
    if len(lines) > 1:
        return ' '.join(lines[1:]).strip()
    
    return ""


def extract_basic_features(text):
    """Extract basic text features like length, word count, etc."""
    if not isinstance(text, str):
        text = ""
    
    features = {}
    
    # Text length
    features['text_len'] = len(text)
    
    # Number of words
    words = text.split()
    features['num_words'] = len(words)
    
    # Average word length
    if features['num_words'] > 0:
        features['avg_word_len'] = sum(len(word) for word in words) / features['num_words']
    else:
        features['avg_word_len'] = 0
    
    # Number of digits
    features['num_digits'] = sum(c.isdigit() for c in text)
    
    # Number of uppercase letters
    features['num_upper'] = sum(c.isupper() for c in text)
    
    # Number of lowercase letters
    features['num_lower'] = sum(c.islower() for c in text)
    
    return features


def process_catalog_content(df):
    """Process catalog content and extract features"""
    if 'catalog_content' not in df.columns:
        print("Warning: catalog_content not found in dataframe")
        return df
    
    print("Processing catalog content...")
    
    # Create copies of the features to avoid modifying the original
    df_processed = df.copy()
    
    # Extract text components
    df_processed['title'] = df_processed['catalog_content'].apply(extract_title)
    df_processed['description'] = df_processed['catalog_content'].apply(extract_description)
    
    # Clean text fields
    df_processed['clean_title'] = df_processed['title'].apply(clean_text)
    df_processed['clean_description'] = df_processed['description'].apply(clean_text)
    
    # Combine all cleaned text for a single text feature
    df_processed['all_text'] = df_processed['clean_title'] + ' ' + df_processed['clean_description']
    
    # Extract IPQ and brand
    df_processed['ipq'] = df_processed['catalog_content'].apply(extract_ipq)
    df_processed['brand'] = df_processed['catalog_content'].apply(extract_brand)
    
    # Extract basic text features
    basic_features = df_processed['all_text'].apply(extract_basic_features)
    
    # Convert dictionary of features to columns
    for feature in ['text_len', 'num_words', 'avg_word_len', 'num_digits', 
                   'num_upper', 'num_lower']:
        df_processed[feature] = basic_features.apply(lambda x: x.get(feature, 0))
    
    return df_processed


def encode_categorical_features(train_df, test_df, categorical_cols=['brand']):
    """Encode categorical features using label encoding with Unknown handling"""
    encoders = {}
    train_df_encoded = train_df.copy()
    test_df_encoded = test_df.copy()
    
    for col in categorical_cols:
        if col in train_df.columns and col in test_df.columns:
            print(f"Encoding {col}...")
            
            # Initialize LabelEncoder
            encoder = LabelEncoder()
            
            # Get all unique values from both train and test
            all_values = pd.concat([
                train_df[col].fillna('Unknown'),
                test_df[col].fillna('Unknown')
            ]).unique()
            
            # Make sure 'Unknown' is in the values
            if 'Unknown' not in all_values:
                all_values = np.append(all_values, 'Unknown')
                
            # Fit encoder on all values
            encoder.fit(all_values)
            
            # Transform train and test data
            train_df_encoded[f'{col}_encoded'] = encoder.transform(train_df[col].fillna('Unknown'))
            test_df_encoded[f'{col}_encoded'] = encoder.transform(test_df[col].fillna('Unknown'))
            
            # Store encoder for later use
            encoders[col] = encoder
    
    return train_df_encoded, test_df_encoded, encoders


def generate_tfidf_svd_features(train_df, test_df, text_col='all_text', cache_dir=None):
    """Generate TF-IDF features and apply SVD dimensionality reduction"""
    
    if cache_dir:
        tfidf_cache_path = os.path.join(cache_dir, 'tfidf_vectorizer.pkl')
        svd_cache_path = os.path.join(cache_dir, 'tfidf_svd.pkl')
        
        # Check if cached files exist
        if os.path.exists(tfidf_cache_path) and os.path.exists(svd_cache_path):
            print("Loading TF-IDF and SVD models from cache...")
            vectorizer = joblib.load(tfidf_cache_path)
            svd = joblib.load(svd_cache_path)
        else:
            vectorizer = None
            svd = None
    else:
        vectorizer = None
        svd = None
    
    if vectorizer is None:
        print("Generating TF-IDF features...")
        
        # Configure TF-IDF vectorizer
        vectorizer = TfidfVectorizer(
            max_features=40000,  # Limit vocabulary size
            min_df=3,            # Minimum document frequency
            max_df=0.95,         # Maximum document frequency
            ngram_range=(1, 2),  # Unigrams and bigrams
            lowercase=True,
            strip_accents='unicode',
            analyzer='word',
            token_pattern=r'\\w{1,}'  # Match words of at least length 1
        )
        
        # Fit on training data
        train_text = train_df[text_col].fillna('').values
        vectorizer.fit(train_text)
        
        # Cache the vectorizer
        if cache_dir:
            joblib.dump(vectorizer, tfidf_cache_path)
    
    # Transform train and test data
    train_text = train_df[text_col].fillna('').values
    test_text = test_df[text_col].fillna('').values
    
    print("Transforming text data with TF-IDF...")
    train_tfidf = vectorizer.transform(train_text)
    test_tfidf = vectorizer.transform(test_text)
    
    print(f"TF-IDF features shape - Train: {train_tfidf.shape}, Test: {test_tfidf.shape}")
    
    # Apply SVD for dimensionality reduction
    if svd is None:
        n_components = min(256, min(train_tfidf.shape[0], train_tfidf.shape[1]) - 1)
        print(f"Applying SVD to reduce dimensions to {n_components}...")
        
        svd = TruncatedSVD(n_components=n_components, random_state=RANDOM_SEED)
        svd.fit(train_tfidf)
        
        # Cache the SVD model
        if cache_dir:
            joblib.dump(svd, svd_cache_path)
    
    # Transform TF-IDF with SVD
    train_tfidf_svd = svd.transform(train_tfidf)
    test_tfidf_svd = svd.transform(test_tfidf)
    
    print(f"SVD features shape - Train: {train_tfidf_svd.shape}, Test: {test_tfidf_svd.shape}")
    
    # Create feature names
    tfidf_svd_feature_names = [f'tfidf_svd_{i}' for i in range(train_tfidf_svd.shape[1])]
    
    # Convert to DataFrame
    train_tfidf_svd_df = pd.DataFrame(
        train_tfidf_svd, 
        columns=tfidf_svd_feature_names,
        index=train_df.index
    )
    
    test_tfidf_svd_df = pd.DataFrame(
        test_tfidf_svd,
        columns=tfidf_svd_feature_names,
        index=test_df.index
    )
    
    return train_tfidf_svd_df, test_tfidf_svd_df, vectorizer, svd


def prepare_features_for_modeling(train_df, test_df, tfidf_svd_df_train, tfidf_svd_df_test):
    """Combine all features for model training"""
    
    # Start with the numerical features
    numerical_features = [
        'ipq', 'text_len', 'num_words', 'avg_word_len', 'num_digits',
        'num_upper', 'num_lower'
    ]
    
    # Add encoded categorical features
    categorical_features = ['brand_encoded']
    
    # Combine all tabular features
    tabular_features = numerical_features + categorical_features
    
    # Select only features that exist in both train and test
    existing_tabular_features = [f for f in tabular_features 
                               if f in train_df.columns and f in test_df.columns]
    
    print(f"Using {len(existing_tabular_features)} tabular features")
    
    # Start with tabular features
    train_features = train_df[existing_tabular_features].copy()
    test_features = test_df[existing_tabular_features].copy()
    
    # Add TF-IDF SVD features
    print("Adding TF-IDF SVD features...")
    train_features = pd.concat([train_features, tfidf_svd_df_train], axis=1)
    test_features = pd.concat([test_features, tfidf_svd_df_test], axis=1)
    
    print(f"Final feature shapes - Train: {train_features.shape}, Test: {test_features.shape}")
    
    return train_features, test_features


def smape(y_true, y_pred):
    """Calculate Symmetric Mean Absolute Percentage Error"""
    # Convert to numpy arrays
    y_true = np.array(y_true)
    y_pred = np.array(y_pred)
    
    # Ensure no zeros (to avoid division by zero)
    y_true = np.maximum(y_true, 0.01)
    y_pred = np.maximum(y_pred, 0.01)
    
    # Calculate SMAPE
    return 100/len(y_true) * np.sum(2 * np.abs(y_pred - y_true) / (np.abs(y_true) + np.abs(y_pred)))


def create_strat_bins(y, n_bins=10):
    """Create bins for stratified cross-validation"""
    return pd.qcut(y, n_bins, labels=False, duplicates='drop')


def setup_cross_validation(X, y, n_splits=5, n_bins=10, random_state=RANDOM_SEED):
    """Set up stratified K-fold cross-validation"""
    # Create bins for stratification
    bins = create_strat_bins(y, n_bins)
    
    # Set up K-fold cross-validation
    kf = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=random_state)
    
    # Generate fold indices
    fold_indices = []
    for train_idx, valid_idx in kf.split(X, bins):
        fold_indices.append((train_idx, valid_idx))
    
    return fold_indices


def train_model_with_cv(X, y, model_class, model_params, folds, model_name='model', cache_dir=None):
    """Train model with cross-validation"""
    # Initialize arrays for OOF predictions
    oof_preds = np.zeros(len(X))
    fold_scores = []
    models = []
    
    print(f"\\nTraining {model_name}")
    
    # Loop through folds
    for fold, (train_idx, valid_idx) in enumerate(folds):
        print(f"Fold {fold+1}/{len(folds)}")
        
        # Split data
        X_train, X_valid = X.iloc[train_idx], X.iloc[valid_idx]
        y_train, y_valid = y.iloc[train_idx], y.iloc[valid_idx]
        
        # Initialize and fit model
        model = model_class(**model_params)
        
        # Special handling for LightGBM
        if isinstance(model, lgb.LGBMRegressor):
            model.fit(
                X_train, y_train,
                eval_set=[(X_valid, y_valid)],
                eval_metric='rmse',
                early_stopping_rounds=50,
                verbose=100
            )
        else:
            model.fit(X_train, y_train)
        
        # Make predictions on validation set
        valid_preds = model.predict(X_valid)
        
        # Store OOF predictions
        oof_preds[valid_idx] = valid_preds
        
        # Transform predictions back to original scale
        valid_preds_original = np.expm1(valid_preds)
        y_valid_original = np.expm1(y_valid)
        
        # Calculate SMAPE
        fold_smape = smape(y_valid_original, valid_preds_original)
        fold_scores.append(fold_smape)
        
        print(f"Fold {fold+1} SMAPE: {fold_smape:.4f}")
        
        # Store model
        models.append(model)
    
    # Calculate overall score
    mean_score = np.mean(fold_scores)
    print(f"Mean SMAPE across {len(folds)} folds: {mean_score:.4f}")
    
    # Save models
    if cache_dir:
        model_path = os.path.join(cache_dir, f"{model_name}_models.pkl")
        joblib.dump(models, model_path)
    
    return {
        'models': models,
        'oof_preds': oof_preds,
        'fold_scores': fold_scores,
        'mean_score': mean_score
    }


def generate_predictions(models, X_test):
    """Generate predictions using multiple fold models"""
    # Get predictions from each fold model
    test_preds_list = []
    for model in models:
        test_preds_fold = model.predict(X_test)
        test_preds_list.append(test_preds_fold)
    
    # Average predictions across folds
    test_preds = np.mean(test_preds_list, axis=0)
    
    return test_preds


def generate_submission(test_df, test_preds, output_path):
    """Generate submission file with predictions"""
    # Convert log predictions back to original scale
    test_preds_original = np.expm1(test_preds)
    
    # Clip to reasonable range (minimum 0.01)
    test_preds_clipped = np.maximum(test_preds_original, 0.01)
    
    # Create submission DataFrame
    submission = pd.DataFrame({
        'sample_id': test_df['sample_id'],
        'price': test_preds_clipped
    })
    
    # Save to CSV
    print(f"Saving submission to {output_path}")
    submission.to_csv(output_path, index=False)
    
    return submission


def main(args):
    """Main function for training and prediction"""
    print("Starting Smart Product Pricing pipeline...")
    start_time = time.time()
    
    # Set up paths
    paths = setup_paths(args.base_path, args.output_path, args.dataset_path)
    print(f"Using dataset path: {paths['dataset_path']}")
    print(f"Output path: {paths['output_path']}")
    
    # Load data
    print("\\nLoading data...")
    try:
        train = pd.read_csv(paths['train_path'])
        test = pd.read_csv(paths['test_path'])
        print(f"Train data shape: {train.shape}")
        print(f"Test data shape: {test.shape}")
    except FileNotFoundError as e:
        print(f"Error loading data: {e}")
        print("Please ensure the dataset files are in the correct location.")
        return 1
    
    # Process catalog content
    print("\\nProcessing catalog content...")
    train_processed = process_catalog_content(train)
    test_processed = process_catalog_content(test)
    
    # Encode categorical features
    print("\\nEncoding categorical features...")
    train_encoded, test_encoded, encoders = encode_categorical_features(
        train_processed, test_processed, categorical_cols=['brand']
    )
    
    # Generate TF-IDF SVD features
    print("\\nGenerating TF-IDF and SVD features...")
    train_tfidf_svd_df, test_tfidf_svd_df, tfidf_vectorizer, tfidf_svd = generate_tfidf_svd_features(
        train_encoded, test_encoded, text_col='all_text', cache_dir=paths['cache_dir']
    )
    
    # Prepare features for modeling
    print("\\nPreparing features for modeling...")
    train_features, test_features = prepare_features_for_modeling(
        train_encoded, test_encoded, 
        tfidf_svd_df_train=train_tfidf_svd_df, 
        tfidf_svd_df_test=test_tfidf_svd_df
    )
    
    # Prepare target variable (log-transformed price)
    print("\\nPreparing target variable...")
    train_encoded['log_price'] = np.log1p(train_encoded['price'])
    
    # Handle outliers in the target variable
    upper_threshold = np.percentile(train_encoded['log_price'], 99.9)
    train_encoded['log_price_capped'] = np.minimum(train_encoded['log_price'], upper_threshold)
    
    # Use capped version for training
    y = train_encoded['log_price_capped']
    
    # Set up cross-validation
    print("\\nSetting up cross-validation...")
    cv_folds = setup_cross_validation(train_features, y, n_splits=5, n_bins=10)
    
    # Define models to train
    if args.model == 'lightgbm':
        print("\\nTraining LightGBM model...")
        model_result = train_model_with_cv(
            train_features, y, 
            lgb.LGBMRegressor,
            {
                'n_estimators': 1000,
                'learning_rate': 0.05,
                'num_leaves': 31,
                'colsample_bytree': 0.8,
                'subsample': 0.8,
                'reg_alpha': 0.1,
                'reg_lambda': 0.1,
                'n_jobs': -1,
                'random_state': RANDOM_SEED
            },
            cv_folds,
            model_name='lightgbm',
            cache_dir=paths['cache_dir']
        )
    elif args.model == 'ridge':
        print("\\nTraining Ridge model...")
        model_result = train_model_with_cv(
            train_features, y,
            Ridge,
            {'alpha': 1.0, 'random_state': RANDOM_SEED},
            cv_folds,
            model_name='ridge',
            cache_dir=paths['cache_dir']
        )
    else:
        # Train both models
        print("\\nTraining Ridge model...")
        ridge_result = train_model_with_cv(
            train_features, y,
            Ridge,
            {'alpha': 1.0, 'random_state': RANDOM_SEED},
            cv_folds,
            model_name='ridge',
            cache_dir=paths['cache_dir']
        )
        
        print("\\nTraining LightGBM model...")
        lgb_result = train_model_with_cv(
            train_features, y, 
            lgb.LGBMRegressor,
            {
                'n_estimators': 1000,
                'learning_rate': 0.05,
                'num_leaves': 31,
                'colsample_bytree': 0.8,
                'subsample': 0.8,
                'reg_alpha': 0.1,
                'reg_lambda': 0.1,
                'n_jobs': -1,
                'random_state': RANDOM_SEED
            },
            cv_folds,
            model_name='lightgbm',
            cache_dir=paths['cache_dir']
        )
        
        # Use the better performing model
        if ridge_result['mean_score'] < lgb_result['mean_score']:
            print("Ridge model performed better. Using Ridge for predictions.")
            model_result = ridge_result
        else:
            print("LightGBM model performed better. Using LightGBM for predictions.")
            model_result = lgb_result
    
    # Generate predictions
    print("\\nGenerating predictions...")
    test_preds = generate_predictions(model_result['models'], test_features)
    
    # Generate submission
    submission = generate_submission(test_encoded, test_preds, paths['output_csv_path'])
    
    # Save metrics
    metrics = {
        'timestamp': datetime.now().strftime('%Y-%m-%d %H:%M:%S'),
        'model': {
            'name': args.model if args.model != 'both' else 'best_of_both',
            'mean_smape': model_result['mean_score'],
            'fold_smapes': model_result['fold_scores']
        },
        'runtime_seconds': time.time() - start_time
    }
    
    # Save to JSON
    with open(paths['metrics_path'], 'w') as f:
        json.dump(metrics, f, indent=2)
    
    print(f"\\nSaved metrics to {paths['metrics_path']}")
    
    # Print final performance
    print("\\n----- Final Performance -----")
    print(f"Mean SMAPE: {model_result['mean_score']:.4f}")
    print("Per-fold SMAPE:")
    for i, score in enumerate(model_result['fold_scores']):
        print(f"Fold {i+1}: {score:.4f}")
    
    # Print submission file path
    print(f"\\nSubmission file: {paths['output_csv_path']}")
    print(f"Runtime: {time.time() - start_time:.2f} seconds")
    
    return 0


if __name__ == "__main__":
    parser = argparse.ArgumentParser(description="Train and predict for Smart Product Pricing Challenge")
    parser.add_argument("--base-path", type=str, default=None, help="Base path for input files")
    parser.add_argument("--output-path", type=str, default=None, help="Path for output files")
    parser.add_argument("--dataset-path", type=str, default=None, help="Path to dataset directory")
    parser.add_argument("--model", type=str, default="both", choices=["ridge", "lightgbm", "both"], 
                        help="Model to train (ridge, lightgbm, or both)")
    
    args = parser.parse_args()
    sys.exit(main(args))
'''
    
    print(f"Generating train_predict.py script to {output_path}")
    with open(output_path, 'w') as f:
        f.write(script_content)
    
    print(f"Script generated successfully: {output_path}")
    
    # Make script executable
    if not output_path.endswith('.py'):
        print("Warning: Script filename should end with .py")
    else:
        try:
            os.chmod(output_path, 0o755)  # Make executable
            print("Script is now executable")
        except Exception as e:
            print(f"Could not make script executable: {e}")
    
    print("\nTo run the script:")
    print(f"python {output_path} --output-path /path/to/output")
    
    return output_path

# Generate README
def generate_readme(output_path=None):
    """Generate README.md with instructions for running the pipeline"""
    if output_path is None:
        output_path = os.path.join(OUTPUT_PATH, 'README.md')
    
    readme_content = '''# Smart Product Pricing Challenge

This repository contains an end-to-end machine learning pipeline for predicting product prices based on catalog content and image features.

## Overview

The pipeline uses a combination of text processing, feature engineering, and machine learning models to predict product prices. The approach includes:

1. Text feature extraction from catalog content
2. Image feature extraction using pretrained models (optional)
3. Model training with cross-validation
4. Ensemble methods for improved performance

## Requirements

- Python 3.8+
- Required packages: numpy, pandas, scikit-learn, lightgbm, tqdm, joblib
- Optional packages: torch, torchvision, sentence_transformers

## How to Run

### Jupyter Notebook

The main pipeline is available in `product_price_pipeline.ipynb`. To run it:

1. Open the notebook in Jupyter or VS Code
2. Run all cells sequentially
3. The final submission will be saved as `test_out.csv`

### Command-Line Script

Alternatively, use the provided command-line script:

```bash
python train_predict.py --output-path /path/to/output
```

Options:
- `--base-path`: Base path for input files (default: auto-detect)
- `--output-path`: Path for output files (default: current directory or /kaggle/working)
- `--dataset-path`: Path to dataset directory (default: auto-detect)
- `--model`: Model to use - "ridge", "lightgbm", or "both" (default: "both")

## Quick Baseline

For a fast baseline (< 10 minutes runtime), run the "Quick Baseline" section in the notebook.

## Model Performance

The pipeline uses multiple models and ensemble methods to achieve the best performance:

1. Ridge Regression on TF-IDF SVD features
2. LightGBM on combined features
3. Stacking ensemble for final predictions

Typical SMAPE performance on cross-validation: ~8-10%

## Methodology

See `Documentation_template.md` for a detailed explanation of the methodology.
'''
    
    print(f"Generating README.md to {output_path}")
    with open(output_path, 'w') as f:
        f.write(readme_content)
    
    print(f"README generated successfully: {output_path}")
    
    return output_path

# Generate files
train_predict_path = generate_train_predict_script()
readme_path = generate_readme()

print(f"\nUtility scripts generated:")
print(f"1. Train/Predict script: {train_predict_path}")
print(f"2. README file: {readme_path}")