# Multi-Segment Ensemble Model Training Pipeline

This notebook implements the Ridge Regression Meta-Model ensemble training across multiple popular branch-tonnage-star rating combinations.

## Overview
- Train ensemble models for top 10-15 segment combinations
- Use 2022-2024 as training data and 2025 as test data
- Save all trained models to disk with proper naming
- Log performance metrics and model file paths

## Target Combinations (Top 10 by transaction count)
1. CHENNAI, 1.5T, 3 Star (15,744 transactions)
2. CHENNAI, 1.0T, 3 Star (8,253 transactions)
3. VIJAYAWADA, 1.5T, 5 Star (7,083 transactions)
4. COCHIN, 1.0T, 3 Star (6,643 transactions)
5. COCHIN, 1.5T, 3 Star (6,063 transactions)
6. VIJAYAWADA, 1.5T, 3 Star (5,847 transactions)
7. HYDERABAD, 1.5T, 3 Star (5,806 transactions)
8. CHENNAI, 1.5T, 5 Star (5,542 transactions)
9. HYDERABAD, 1.5T, 5 Star (4,636 transactions)
10. BANGALORE, 1.5T, 3 Star (3,959 transactions)


In [1]:
# Import all required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime
import warnings
import pickle
import os
import time
from sklearn.metrics import mean_absolute_error, mean_squared_error, mean_absolute_percentage_error
from sklearn.linear_model import Ridge
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import LabelEncoder

# Time series models
from statsmodels.tsa.holtwinters import ExponentialSmoothing
from prophet import Prophet

warnings.filterwarnings('ignore')

# Set random seeds for reproducibility
np.random.seed(42)
RANDOM_STATE = 42

print("All libraries imported successfully!")
print(f"Pandas version: {pd.__version__}")
print(f"NumPy version: {np.__version__}")


All libraries imported successfully!
Pandas version: 2.3.3
NumPy version: 2.3.4


  from .autonotebook import tqdm as notebook_tqdm


In [2]:
# Helper functions for model saving and loading
def save_model(model, filepath):
    """Save a trained model to disk as pickle file"""
    os.makedirs(os.path.dirname(filepath), exist_ok=True)
    with open(filepath, 'wb') as f:
        pickle.dump(model, f)
    return filepath

def load_model(filepath):
    """Load a trained model from disk"""
    with open(filepath, 'rb') as f:
        model = pickle.load(f)
    return model

def load_saved_models(segment_name):
    """Load all models for a specific segment"""
    models = {}
    try:
        models['prophet_weather'] = load_model(
            f'outputs/models/multi_segment/{segment_name}_prophet_weather.pkl')
        models['holtwinters'] = load_model(
            f'outputs/models/multi_segment/{segment_name}_holtwinters.pkl')
        models['prophet_univariate'] = load_model(
            f'outputs/models/multi_segment/{segment_name}_prophet_univariate.pkl')
        models['ridge_meta'] = load_model(
            f'outputs/models/multi_segment/{segment_name}_ridge_meta.pkl')
        print(f"Successfully loaded all models for segment: {segment_name}")
    except FileNotFoundError as e:
        print(f"Error loading models for {segment_name}: {e}")
    return models

def calculate_metrics(y_true, y_pred):
    """Calculate MAE, RMSE, and MAPE"""
    mae = mean_absolute_error(y_true, y_pred)
    rmse = np.sqrt(mean_squared_error(y_true, y_pred))
    mape = mean_absolute_percentage_error(y_true, y_pred) * 100
    return {'MAE': mae, 'RMSE': rmse, 'MAPE': mape}

def create_segment_name(branch, tonnage, star_rating):
    """Create standardized segment name for file naming"""
    branch_clean = branch.lower().replace(' ', '_')
    tonnage_clean = str(tonnage).replace('.', '_')
    star_clean = star_rating.lower().replace(' ', '')
    return f"{branch_clean}_{tonnage_clean}_{star_clean}"

print("Helper functions defined successfully!")


Helper functions defined successfully!


In [3]:
# Define target segments based on transaction count analysis
target_segments = [
    {'branch': 'CHENNAI', 'tonnage': 1.5, 'star_rating': '3 Star', 'transactions': 15744},
    {'branch': 'CHENNAI', 'tonnage': 1.0, 'star_rating': '3 Star', 'transactions': 8253},
    {'branch': 'VIJAYAWADA', 'tonnage': 1.5, 'star_rating': '5 Star', 'transactions': 7083},
    {'branch': 'COCHIN', 'tonnage': 1.0, 'star_rating': '3 Star', 'transactions': 6643},
    {'branch': 'COCHIN', 'tonnage': 1.5, 'star_rating': '3 Star', 'transactions': 6063},
    {'branch': 'VIJAYAWADA', 'tonnage': 1.5, 'star_rating': '3 Star', 'transactions': 5847},
    {'branch': 'HYDERABAD', 'tonnage': 1.5, 'star_rating': '3 Star', 'transactions': 5806},
    {'branch': 'CHENNAI', 'tonnage': 1.5, 'star_rating': '5 Star', 'transactions': 5542},
    {'branch': 'HYDERABAD', 'tonnage': 1.5, 'star_rating': '5 Star', 'transactions': 4636},
    {'branch': 'BANGALORE', 'tonnage': 1.5, 'star_rating': '3 Star', 'transactions': 3959}
]

print(f"Target segments defined: {len(target_segments)} combinations")
print("\nSegment details:")
for i, segment in enumerate(target_segments, 1):
    print(f"{i:2d}. {segment['branch']} - {segment['tonnage']}T - {segment['star_rating']} ({segment['transactions']:,} transactions)")


Target segments defined: 10 combinations

Segment details:
 1. CHENNAI - 1.5T - 3 Star (15,744 transactions)
 2. CHENNAI - 1.0T - 3 Star (8,253 transactions)
 3. VIJAYAWADA - 1.5T - 5 Star (7,083 transactions)
 4. COCHIN - 1.0T - 3 Star (6,643 transactions)
 5. COCHIN - 1.5T - 3 Star (6,063 transactions)
 6. VIJAYAWADA - 1.5T - 3 Star (5,847 transactions)
 7. HYDERABAD - 1.5T - 3 Star (5,806 transactions)
 8. CHENNAI - 1.5T - 5 Star (5,542 transactions)
 9. HYDERABAD - 1.5T - 5 Star (4,636 transactions)
10. BANGALORE - 1.5T - 3 Star (3,959 transactions)


In [4]:
# Load and prepare the main dataset
print("Loading and preparing main dataset...")

# Load the cleaned data (this should match the processed data from notebook.ipynb)
df = pd.read_csv('data/Final Sales.csv')

# Handle missing values
df['Star rating'].fillna('3 Star', inplace=True)
df['Tonnage'].fillna(1.5, inplace=True)

# Convert Date column to datetime and create time-based features
df['Date'] = pd.to_datetime(df['Date'])
df['Year'] = df['Date'].dt.year
df['Month'] = df['Date'].dt.month
df['Day'] = df['Date'].dt.day
df['DayOfWeek'] = df['Date'].dt.day_name()
df['MonthName'] = df['Date'].dt.month_name()

# Remove outliers using IQR method (as done in notebook.ipynb)
sales_qty = df['Sales Qty.']
Q1 = sales_qty.quantile(0.25)
Q3 = sales_qty.quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

df_cleaned = df[(sales_qty >= lower_bound) & (sales_qty <= upper_bound)].copy()
print(f"Original data: {len(df):,} records")
print(f"After outlier removal: {len(df_cleaned):,} records")
print(f"Outliers removed: {len(df) - len(df_cleaned):,} records ({(len(df) - len(df_cleaned))/len(df)*100:.1f}%)")

# Load weather data
weather_dir = 'outputs/processed_weather_data/'
state_mapping = {
    'AP_weather_timeseries.csv': 'Andhra Pradesh',
    'KA_weather_timeseries.csv': 'Karnataka', 
    'KL_weather_timeseries.csv': 'Kerala',
    'TN_weather_timeseries.csv': 'Tamil Nadu',
    'TL_weather_timeseries.csv': 'Telangana'
}

weather_data_list = []
for filename, state in state_mapping.items():
    filepath = os.path.join(weather_dir, filename)
    if os.path.exists(filepath):
        weather_df = pd.read_csv(filepath)
        weather_df['State'] = state
        weather_data_list.append(weather_df)

if weather_data_list:
    all_weather_data = pd.concat(weather_data_list, ignore_index=True)
    all_weather_data['Date'] = pd.to_datetime(all_weather_data['Date'])
    all_weather_data['Year'] = all_weather_data['Date'].dt.year
    all_weather_data['Month'] = all_weather_data['Date'].dt.month
    all_weather_data['merge_key'] = all_weather_data['Year'].astype(str) + '-' + \
                                   all_weather_data['Month'].astype(str).str.zfill(2) + '-' + \
                                   all_weather_data['State']
    weather_for_merge = all_weather_data.drop('Date', axis=1)
    print(f"Weather data loaded: {len(weather_for_merge)} records")
else:
    print("Warning: No weather data found!")
    weather_for_merge = pd.DataFrame()

# Create population data
population_data = {
    'CHENNAI': [11.23, 11.5, 11.77, 12.05, 12.33],
    'COCHIN': [3.19, 3.30, 3.40, 3.50, 3.60],
    'HYDERABAD': [10.26, 10.53, 10.80, 11.38, 11.10],
    'VIJAYAWADA': [2.10, 2.16, 2.23, 2.29, 2.35],
    'BANGALORE': [12.76, 13.19, 13.60, 14.00, 14.39]
}

population_list = []
years = [2021, 2022, 2023, 2024, 2025]
for branch, pop_values in population_data.items():
    for i, year in enumerate(years):
        population_list.append({
            'Branch': branch,
            'Year': year,
            'Population_Millions': pop_values[i]
        })

population_df = pd.DataFrame(population_list)
print(f"Population data created: {len(population_df)} records")

print("\nData preparation completed!")


Loading and preparing main dataset...
Original data: 118,465 records
After outlier removal: 104,514 records
Outliers removed: 13,951 records (11.8%)
Weather data loaded: 300 records
Population data created: 25 records

Data preparation completed!


In [5]:
# Merge weather and population data with main dataset
print("Merging weather and population data...")

# Create merge key for main dataframe
df_cleaned['merge_key'] = df_cleaned['Year'].astype(str) + '-' + \
                        df_cleaned['Month'].astype(str).str.zfill(2) + '-' + \
                        df_cleaned['State']

# Merge weather data
if not weather_for_merge.empty:
    df_with_weather = df_cleaned.merge(
        weather_for_merge, 
        on='merge_key', 
        how='left',
        suffixes=('', '_weather')
    )
    print(f"Weather data merged successfully")
else:
    df_with_weather = df_cleaned.copy()
    print("No weather data to merge")

# Merge population data
df_final = df_with_weather.merge(
    population_df, 
    on=['Branch', 'Year'], 
    how='left'
)

# Clean up temporary columns
df_final = df_final.drop('merge_key', axis=1, errors='ignore')
if 'State_weather' in df_final.columns:
    df_final = df_final.drop('State_weather', axis=1)
if 'Year_weather' in df_final.columns:
    df_final = df_final.drop('Year_weather', axis=1)
if 'Month_weather' in df_final.columns:
    df_final = df_final.drop('Month_weather', axis=1)

# Create additional features
df_final['Sales_Per_Capita'] = df_final['Sales Qty.'] / df_final['Population_Millions']

# Create season mapping
season_map = {12: 'Winter', 1: 'Winter', 2: 'Winter',
              3: 'Spring', 4: 'Spring', 5: 'Spring',
              6: 'Summer', 7: 'Summer', 8: 'Summer',
              9: 'Fall', 10: 'Fall', 11: 'Fall'}
df_final['Season'] = df_final['Month'].map(season_map)

# Encode season
le_season = LabelEncoder()
df_final['Season_encoded'] = le_season.fit_transform(df_final['Season'])

print(f"Final dataset shape: {df_final.shape}")
print(f"Columns: {list(df_final.columns)}")
print(f"Date range: {df_final['Date'].min()} to {df_final['Date'].max()}")
print(f"Years: {sorted(df_final['Year'].unique())}")

# Check for missing values in key columns
key_cols = ['Sales Qty.', 'Population_Millions', 'Sales_Per_Capita', 'Season_encoded']
if 'Max_Temp' in df_final.columns:
    key_cols.extend(['Max_Temp', 'Min_Temp', 'Humidity', 'Wind_Speed'])

print("\nMissing values in key columns:")
for col in key_cols:
    if col in df_final.columns:
        missing = df_final[col].isnull().sum()
        missing_pct = (missing / len(df_final)) * 100
        print(f"  {col}: {missing:,} ({missing_pct:.2f}%)")

print("\nData preparation and merging completed!")


Merging weather and population data...
Weather data merged successfully
Final dataset shape: (104514, 23)
Columns: ['Date', 'Branch', 'Inv. No.', 'Item Code', 'Sales Qty.', 'State', 'Status', 'Star rating', 'Tonnage', 'Technology', 'Year', 'Month', 'Day', 'DayOfWeek', 'MonthName', 'Max_Temp', 'Min_Temp', 'Humidity', 'Wind_Speed', 'Population_Millions', 'Sales_Per_Capita', 'Season', 'Season_encoded']
Date range: 2021-11-25 00:00:00 to 2025-06-30 00:00:00
Years: [np.int32(2021), np.int32(2022), np.int32(2023), np.int32(2024), np.int32(2025)]

Missing values in key columns:
  Sales Qty.: 0 (0.00%)
  Population_Millions: 0 (0.00%)
  Sales_Per_Capita: 0 (0.00%)
  Season_encoded: 0 (0.00%)
  Max_Temp: 0 (0.00%)
  Min_Temp: 0 (0.00%)
  Humidity: 0 (0.00%)
  Wind_Speed: 0 (0.00%)

Data preparation and merging completed!


In [None]:
# Data preparation functions for each segment
def prepare_segment_data(segment_config, df_final):
    """
    Prepare train/test data for a specific segment combination
    
    Args:
        segment_config: Dictionary with branch, tonnage, star_rating
        df_final: Main dataframe with all features
    
    Returns:
        train_df, test_df: Monthly aggregated data for training and testing
    """
    branch = segment_config['branch']
    tonnage = segment_config['tonnage']
    star_rating = segment_config['star_rating']
    
    print(f"  Preparing data for {branch} - {tonnage}T - {star_rating}")
    
    # Filter data for this segment
    segment_data = df_final[
        (df_final['Branch'] == branch) & 
        (df_final['Tonnage'] == tonnage) & 
        (df_final['Star rating'] == star_rating)
    ].copy()
    
    if len(segment_data) == 0:
        print(f"    Warning: No data found for {branch} - {tonnage}T - {star_rating}")
        return None, None
    
    # Create YearMonth column for aggregation - FIXED DATE FORMATTING
    segment_data['YearMonth'] = pd.to_datetime(
        segment_data['Year'].astype(str) + '-' + 
        segment_data['Month'].astype(str).str.zfill(2) + '-01'
    )
    
    # Aggregate to monthly level
    monthly_data = segment_data.groupby('YearMonth').agg({
        'Sales Qty.': 'sum',
        'Max_Temp': 'mean',
        'Min_Temp': 'mean', 
        'Humidity': 'mean',
        'Wind_Speed': 'mean',
        'Population_Millions': 'mean',
        'Sales_Per_Capita': 'mean',
        'Season_encoded': lambda x: x.mode()[0] if not x.mode().empty else 0
    }).reset_index()
    
    # Split into train (2022-2024) and test (2025)
    train_data = monthly_data[monthly_data['YearMonth'].dt.year.isin([2022, 2023, 2024])].copy()
    test_data = monthly_data[monthly_data['YearMonth'].dt.year == 2025].copy()
    
    print(f"    Train samples: {len(train_data)} (2022-2024)")
    print(f"    Test samples: {len(test_data)} (2025)")
    
    if len(train_data) == 0:
        print(f"    Warning: No training data for {branch} - {tonnage}T - {star_rating}")
        return None, None
    
    if len(test_data) == 0:
        print(f"    Warning: No test data for {branch} - {tonnage}T - {star_rating}")
        return None, None
    
    return train_data, test_data

def save_segment_data(train_df, test_df, segment_name):
    """Save train/test data to CSV files"""
    train_file = f'outputs/data/{segment_name}_train.csv'
    test_file = f'outputs/data/{segment_name}_test.csv'
    
    os.makedirs(os.path.dirname(train_file), exist_ok=True)
    
    train_df.to_csv(train_file, index=False)
    test_df.to_csv(test_file, index=False)
    
    return train_file, test_file

print("Data preparation functions defined successfully!")


Data preparation functions defined successfully!


In [None]:
# Base model training functions
def train_prophet_weather_model(y_train, X_train_weather, y_test, X_test_weather, segment_name):
    """Train Prophet model with weather features - EXACT IMPLEMENTATION FROM ts.ipynb"""
    print(f"    Training Prophet (Weather) model...")
    
    try:
        # Prepare data for Prophet with weather regressors - EXACT FROM ts.ipynb
        prophet_weather_train = pd.DataFrame({
            'ds': y_train.index,
            'y': y_train.values
        })
        
        # Add weather regressors
        weather_cols = ['Max_Temp', 'Min_Temp', 'Humidity', 'Wind_Speed']
        for col in weather_cols:
            if col in X_train_weather.columns:
                prophet_weather_train[col] = X_train_weather[col].values
        
        # Fit Prophet model with weather regressors - EXACT FROM ts.ipynb
        prophet_weather_model = Prophet(yearly_seasonality=True, weekly_seasonality=False, daily_seasonality=False)
        
        # Add weather regressors
        for col in weather_cols:
            if col in X_train_weather.columns:
                prophet_weather_model.add_regressor(col)
        
        prophet_weather_model.fit(prophet_weather_train)
        
        # Create future dataframe with weather regressors - EXACT FROM ts.ipynb
        future_weather = prophet_weather_model.make_future_dataframe(periods=len(y_test), freq='MS')
        future_weather = future_weather.tail(len(y_test))  # Only keep the forecast period
        
        # Add weather regressor values for future period
        for col in weather_cols:
            if col in X_test_weather.columns:
                future_weather[col] = X_test_weather[col].values
        
        # Make predictions
        prophet_weather_forecast = prophet_weather_model.predict(future_weather)
        prophet_weather_pred = prophet_weather_forecast['yhat'].values
        
        # Save model
        model_file = save_model(prophet_weather_model, 
            f'outputs/models/multi_segment/{segment_name}_prophet_weather.pkl')
        
        print(f"      Prophet (Weather) model saved: {model_file}")
        return prophet_weather_pred, model_file
        
    except Exception as e:
        print(f"      Error training Prophet (Weather): {e}")
        return None, None

def train_holtwinters_model(y_train, y_test, segment_name):
    """Train Holt-Winters exponential smoothing model - EXACT IMPLEMENTATION FROM ts.ipynb"""
    print(f"    Training Holt-Winters model...")
    
    try:
        # Fit Holt-Winters model with additive seasonality - EXACT FROM ts.ipynb
        hw_model = ExponentialSmoothing(
            y_train, 
            trend='add', 
            seasonal='add', 
            seasonal_periods=12
        ).fit(optimized=True)
        
        # Make predictions
        hw_pred = hw_model.forecast(steps=len(y_test))
        
        # Save model
        model_file = save_model(hw_model, 
            f'outputs/models/multi_segment/{segment_name}_holtwinters.pkl')
        
        print(f"      Holt-Winters model saved: {model_file}")
        return hw_pred, model_file
        
    except Exception as e:
        print(f"      Error training Holt-Winters: {e}")
        return None, None

def train_prophet_univariate_model(y_train, y_test, segment_name):
    """Train Prophet model on sales data only - EXACT IMPLEMENTATION FROM ts.ipynb"""
    print(f"    Training Prophet (Univariate) model...")
    
    try:
        # Prepare data for Prophet - EXACT FROM ts.ipynb
        prophet_train = pd.DataFrame({
            'ds': y_train.index,
            'y': y_train.values
        })
        
        # Fit Prophet model
        prophet_model = Prophet(yearly_seasonality=True, weekly_seasonality=False, daily_seasonality=False)
        prophet_model.fit(prophet_train)
        
        # Create future dataframe - EXACT FROM ts.ipynb
        future = prophet_model.make_future_dataframe(periods=len(y_test), freq='MS')
        future = future.tail(len(y_test))  # Only keep the forecast period
        
        # Make predictions
        prophet_forecast = prophet_model.predict(future)
        prophet_pred = prophet_forecast['yhat'].values
        
        # Save model
        model_file = save_model(prophet_model, 
            f'outputs/models/multi_segment/{segment_name}_prophet_univariate.pkl')
        
        print(f"      Prophet (Univariate) model saved: {model_file}")
        return prophet_pred, model_file
        
    except Exception as e:
        print(f"      Error training Prophet (Univariate): {e}")
        return None, None

def train_base_models(train_df, test_df, segment_name):
    """Train all three base models and return predictions"""
    print(f"  Training base models for {segment_name}...")
    
    # Extract target and features
    y_train = train_df['Sales Qty.']
    y_test = test_df['Sales Qty.']
    
    # Weather features
    weather_cols = ['Max_Temp', 'Min_Temp', 'Humidity', 'Wind_Speed']
    X_train_weather = train_df[weather_cols] if all(col in train_df.columns for col in weather_cols) else pd.DataFrame()
    X_test_weather = test_df[weather_cols] if all(col in test_df.columns for col in weather_cols) else pd.DataFrame()
    
    base_predictions = []
    model_files = []
    
    # Train Prophet (Weather)
    if not X_train_weather.empty:
        prophet_weather_pred, prophet_weather_file = train_prophet_weather_model(
            y_train, X_train_weather, y_test, X_test_weather, segment_name)
        if prophet_weather_pred is not None:
            base_predictions.append(prophet_weather_pred)
            model_files.append(prophet_weather_file)
    
    # Train Holt-Winters
    holtwinters_pred, holtwinters_file = train_holtwinters_model(y_train, y_test, segment_name)
    if holtwinters_pred is not None:
        base_predictions.append(holtwinters_pred)
        model_files.append(holtwinters_file)
    
    # Train Prophet (Univariate)
    prophet_univariate_pred, prophet_univariate_file = train_prophet_univariate_model(y_train, y_test, segment_name)
    if prophet_univariate_pred is not None:
        base_predictions.append(prophet_univariate_pred)
        model_files.append(prophet_univariate_file)
    
    print(f"  Base models training completed. {len(base_predictions)} models trained successfully.")
    return base_predictions, model_files

print("Base model training functions defined successfully!")


Base model training functions defined successfully!


In [None]:
# Ridge meta-model training function
def train_ridge_meta_model(base_predictions, y_test, segment_name):
    """Train Ridge regression meta-model using base model predictions - EXACT IMPLEMENTATION FROM ts.ipynb"""
    print(f"  Training Ridge meta-model for {segment_name}...")
    
    try:
        if len(base_predictions) < 2:
            print(f"    Warning: Need at least 2 base models for ensemble. Only {len(base_predictions)} available.")
            return None, None, None
        
        # Create prediction matrix
        predictions_matrix = np.array(base_predictions).T
        
        # Try different alpha values for Ridge regression - EXACT FROM ts.ipynb
        alphas = [0.1, 1.0, 10.0, 100.0]
        ridge_model = Ridge()
        grid_search = GridSearchCV(ridge_model, {'alpha': alphas}, cv=3, scoring='neg_mean_absolute_error')
        grid_search.fit(predictions_matrix, y_test.values)
        
        best_ridge = grid_search.best_estimator_
        ridge_pred = best_ridge.predict(predictions_matrix)
        
        # Save model
        model_file = save_model(best_ridge, 
            f'outputs/models/multi_segment/{segment_name}_ridge_meta.pkl')
        
        print(f"    Ridge meta-model saved: {model_file}")
        print(f"    Best alpha: {grid_search.best_params_['alpha']}")
        
        return ridge_pred, model_file, grid_search.best_params_['alpha']
        
    except Exception as e:
        print(f"    Error training Ridge meta-model: {e}")
        return None, None, None

print("Ridge meta-model training function defined successfully!")


Ridge meta-model training function defined successfully!


In [9]:
# Complete ensemble training function
def train_segment_ensemble(segment_config, df_final):
    """Train complete ensemble for a single segment"""
    branch = segment_config['branch']
    tonnage = segment_config['tonnage']
    star_rating = segment_config['star_rating']
    segment_name = create_segment_name(branch, tonnage, star_rating)
    
    print(f"\\n{'='*60}")
    print(f"Training ensemble for: {branch} - {tonnage}T - {star_rating}")
    print(f"Segment name: {segment_name}")
    print(f"{'='*60}")
    
    start_time = time.time()
    
    try:
        # Prepare data
        train_df, test_df = prepare_segment_data(segment_config, df_final)
        
        if train_df is None or test_df is None:
            print(f"  Skipping {segment_name} due to insufficient data")
            return None
        
        # Save segment data
        train_file, test_file = save_segment_data(train_df, test_df, segment_name)
        print(f"  Data saved: {train_file}, {test_file}")
        
        # Train base models
        base_predictions, base_model_files = train_base_models(train_df, test_df, segment_name)
        
        if len(base_predictions) == 0:
            print(f"  Skipping {segment_name} - no base models trained successfully")
            return None
        
        # Train Ridge meta-model
        y_test = test_df['Sales Qty.']
        ensemble_pred, ridge_file, best_alpha = train_ridge_meta_model(base_predictions, y_test, segment_name)
        
        if ensemble_pred is None:
            print(f"  Skipping {segment_name} - meta-model training failed")
            return None
        
        # Calculate metrics
        metrics = calculate_metrics(y_test, ensemble_pred)
        training_time = time.time() - start_time
        
        # Prepare results
        results = {
            'Segment': segment_name,
            'Branch': branch,
            'Tonnage': tonnage,
            'Star_Rating': star_rating,
            'Train_Samples': len(train_df),
            'Test_Samples': len(test_df),
            'MAE': metrics['MAE'],
            'RMSE': metrics['RMSE'],
            'MAPE': metrics['MAPE'],
            'Training_Time': training_time,
            'Ridge_Alpha': best_alpha,
            'Base_Models_Count': len(base_predictions)
        }
        
        # Add model file paths
        model_files = base_model_files + [ridge_file]
        model_types = ['Prophet_Weather', 'HoltWinters', 'Prophet_Univariate', 'Ridge_Meta']
        
        for i, model_type in enumerate(model_types):
            if i < len(model_files):
                results[f'{model_type}_Model_File'] = model_files[i]
            else:
                results[f'{model_type}_Model_File'] = None
        
        # Save detailed predictions
        predictions_df = pd.DataFrame({
            'Date': test_df['YearMonth'],
            'Actual': y_test.values,
            'Ensemble_Prediction': ensemble_pred
        })
        
        # Add individual base model predictions
        for i, pred in enumerate(base_predictions):
            model_name = model_types[i] if i < len(model_types) else f'Base_Model_{i+1}'
            predictions_df[f'{model_name}_Prediction'] = pred
        
        predictions_file = f'outputs/multi_segment_predictions_{segment_name}.csv'
        os.makedirs(os.path.dirname(predictions_file), exist_ok=True)
        predictions_df.to_csv(predictions_file, index=False)
        
        print(f"\\n  Results for {segment_name}:")
        print(f"    MAE: {metrics['MAE']:.2f}")
        print(f"    RMSE: {metrics['RMSE']:.2f}")
        print(f"    MAPE: {metrics['MAPE']:.2f}%")
        print(f"    Training time: {training_time:.1f} seconds")
        print(f"    Models saved: {len(model_files)} files")
        print(f"    Predictions saved: {predictions_file}")
        
        return results
        
    except Exception as e:
        print(f"  Error training ensemble for {segment_name}: {e}")
        return None

print("Complete ensemble training function defined successfully!")


Complete ensemble training function defined successfully!


In [10]:
# Main training loop
print("Starting multi-segment ensemble training...")
print(f"Target segments: {len(target_segments)}")
print(f"Training period: 2022-2024")
print(f"Test period: 2025")
print(f"\\n{'='*80}")

# Initialize results storage
all_results = []
successful_segments = 0
failed_segments = 0

# Process each segment
for i, segment in enumerate(target_segments, 1):
    print(f"\\n[{i}/{len(target_segments)}] Processing segment...")
    
    try:
        # Train ensemble for this segment
        segment_results = train_segment_ensemble(segment, df_final)
        
        if segment_results is not None:
            all_results.append(segment_results)
            successful_segments += 1
            
            # Print summary
            print(f"\\n  ✅ SUCCESS: {segment_results['Segment']}")
            print(f"     MAE: {segment_results['MAE']:.2f}")
            print(f"     RMSE: {segment_results['RMSE']:.2f}")
            print(f"     MAPE: {segment_results['MAPE']:.2f}%")
            print(f"     Training time: {segment_results['Training_Time']:.1f}s")
        else:
            failed_segments += 1
            print(f"\\n  ❌ FAILED: {segment['branch']} - {segment['tonnage']}T - {segment['star_rating']}")
            
    except Exception as e:
        failed_segments += 1
        print(f"\\n  ❌ ERROR: {segment['branch']} - {segment['tonnage']}T - {segment['star_rating']}")
        print(f"     Error: {e}")

print(f"\\n{'='*80}")
print("TRAINING COMPLETED")
print(f"{'='*80}")
print(f"Total segments processed: {len(target_segments)}")
print(f"Successful: {successful_segments}")
print(f"Failed: {failed_segments}")
print(f"Success rate: {successful_segments/len(target_segments)*100:.1f}%")

if successful_segments > 0:
    print(f"\\nResults will be saved to outputs/multi_segment_results.csv")
else:
    print(f"\\nNo successful training results to save.")


Starting multi-segment ensemble training...
Target segments: 10
Training period: 2022-2024
Test period: 2025
\n[1/10] Processing segment...
Training ensemble for: CHENNAI - 1.5T - 3 Star
Segment name: chennai_1_5_3star
  Preparing data for CHENNAI - 1.5T - 3 Star
    Train samples: 36 (2022-2024)
    Test samples: 6 (2025)
  Data saved: outputs/data/chennai_1_5_3star_train.csv, outputs/data/chennai_1_5_3star_test.csv
  Training base models for chennai_1_5_3star...
    Training Prophet (Weather) model...
      Error training Prophet (Weather): day is out of range for month: 0, at position 0
    Training Holt-Winters model...
      Holt-Winters model saved: outputs/models/multi_segment/chennai_1_5_3star_holtwinters.pkl
    Training Prophet (Univariate) model...
      Error training Prophet (Univariate): day is out of range for month: 0, at position 0
  Base models training completed. 1 models trained successfully.
  Training Ridge meta-model for chennai_1_5_3star...
  Skipping chennai_1_

In [11]:
# Save results and create summary
if len(all_results) > 0:
    print("\\nSaving results and creating summary...")
    
    # Convert results to DataFrame
    results_df = pd.DataFrame(all_results)
    
    # Save main results
    results_file = 'outputs/multi_segment_results.csv'
    os.makedirs(os.path.dirname(results_file), exist_ok=True)
    results_df.to_csv(results_file, index=False)
    print(f"Results saved to: {results_file}")
    
    # Create summary statistics
    print(f"\\n{'='*60}")
    print("SUMMARY STATISTICS")
    print(f"{'='*60}")
    
    print(f"\\nPerformance Metrics:")
    print(f"  Average MAE: {results_df['MAE'].mean():.2f}")
    print(f"  Average RMSE: {results_df['RMSE'].mean():.2f}")
    print(f"  Average MAPE: {results_df['MAPE'].mean():.2f}%")
    print(f"  Best MAE: {results_df['MAE'].min():.2f}")
    print(f"  Best RMSE: {results_df['RMSE'].min():.2f}")
    print(f"  Best MAPE: {results_df['MAPE'].min():.2f}%")
    
    print(f"\\nTraining Statistics:")
    print(f"  Average training time: {results_df['Training_Time'].mean():.1f} seconds")
    print(f"  Total training time: {results_df['Training_Time'].sum():.1f} seconds")
    print(f"  Average base models: {results_df['Base_Models_Count'].mean():.1f}")
    
    print(f"\\nBest Performing Segments:")
    best_mae = results_df.loc[results_df['MAE'].idxmin()]
    best_rmse = results_df.loc[results_df['RMSE'].idxmin()]
    best_mape = results_df.loc[results_df['MAPE'].idxmin()]
    
    print(f"  Best MAE: {best_mae['Segment']} (MAE: {best_mae['MAE']:.2f})")
    print(f"  Best RMSE: {best_rmse['Segment']} (RMSE: {best_rmse['RMSE']:.2f})")
    print(f"  Best MAPE: {best_mape['Segment']} (MAPE: {best_mape['MAPE']:.2f}%)")
    
    print(f"\\nModel Files Created:")
    model_files_created = 0
    for col in results_df.columns:
        if col.endswith('_Model_File'):
            non_null_files = results_df[col].notna().sum()
            model_files_created += non_null_files
            print(f"  {col}: {non_null_files} files")
    
    print(f"  Total model files: {model_files_created}")
    
    # Display results table
    print(f"\\nDetailed Results:")
    display_cols = ['Segment', 'Branch', 'Tonnage', 'Star_Rating', 'MAE', 'RMSE', 'MAPE', 'Training_Time']
    print(results_df[display_cols].round(2).to_string(index=False))
    
else:
    print("\\nNo results to save - all training attempts failed.")


\nNo results to save - all training attempts failed.


In [12]:
# Create visualizations
if len(all_results) > 0:
    print("\\nCreating visualizations...")
    
    # Set up the plotting style
    plt.style.use('seaborn-v0_8')
    sns.set_palette("husl")
    
    # Create comprehensive visualization
    fig, axes = plt.subplots(2, 3, figsize=(20, 12))
    fig.suptitle('Multi-Segment Ensemble Model Performance Analysis', fontsize=16, fontweight='bold')
    
    # 1. MAE comparison
    ax1 = axes[0, 0]
    mae_data = results_df.sort_values('MAE')
    bars1 = ax1.barh(range(len(mae_data)), mae_data['MAE'], color='skyblue', alpha=0.7)
    ax1.set_yticks(range(len(mae_data)))
    ax1.set_yticklabels(mae_data['Segment'], fontsize=8)
    ax1.set_xlabel('MAE')
    ax1.set_title('Mean Absolute Error (MAE) by Segment')
    ax1.grid(True, alpha=0.3)
    
    # Add value labels on bars
    for i, bar in enumerate(bars1):
        width = bar.get_width()
        ax1.text(width + width*0.01, bar.get_y() + bar.get_height()/2, 
                f'{width:.1f}', ha='left', va='center', fontsize=8)
    
    # 2. RMSE comparison
    ax2 = axes[0, 1]
    rmse_data = results_df.sort_values('RMSE')
    bars2 = ax2.barh(range(len(rmse_data)), rmse_data['RMSE'], color='lightcoral', alpha=0.7)
    ax2.set_yticks(range(len(rmse_data)))
    ax2.set_yticklabels(rmse_data['Segment'], fontsize=8)
    ax2.set_xlabel('RMSE')
    ax2.set_title('Root Mean Squared Error (RMSE) by Segment')
    ax2.grid(True, alpha=0.3)
    
    # Add value labels on bars
    for i, bar in enumerate(bars2):
        width = bar.get_width()
        ax2.text(width + width*0.01, bar.get_y() + bar.get_height()/2, 
                f'{width:.1f}', ha='left', va='center', fontsize=8)
    
    # 3. MAPE comparison
    ax3 = axes[0, 2]
    mape_data = results_df.sort_values('MAPE')
    bars3 = ax3.barh(range(len(mape_data)), mape_data['MAPE'], color='lightgreen', alpha=0.7)
    ax3.set_yticks(range(len(mape_data)))
    ax3.set_yticklabels(mape_data['Segment'], fontsize=8)
    ax3.set_xlabel('MAPE (%)')
    ax3.set_title('Mean Absolute Percentage Error (MAPE) by Segment')
    ax3.grid(True, alpha=0.3)
    
    # Add value labels on bars
    for i, bar in enumerate(bars3):
        width = bar.get_width()
        ax3.text(width + width*0.01, bar.get_y() + bar.get_height()/2, 
                f'{width:.1f}%', ha='left', va='center', fontsize=8)
    
    # 4. Training time comparison
    ax4 = axes[1, 0]
    time_data = results_df.sort_values('Training_Time')
    bars4 = ax4.barh(range(len(time_data)), time_data['Training_Time'], color='gold', alpha=0.7)
    ax4.set_yticks(range(len(time_data)))
    ax4.set_yticklabels(time_data['Segment'], fontsize=8)
    ax4.set_xlabel('Training Time (seconds)')
    ax4.set_title('Training Time by Segment')
    ax4.grid(True, alpha=0.3)
    
    # Add value labels on bars
    for i, bar in enumerate(bars4):
        width = bar.get_width()
        ax4.text(width + width*0.01, bar.get_y() + bar.get_height()/2, 
                f'{width:.1f}s', ha='left', va='center', fontsize=8)
    
    # 5. Performance by branch
    ax5 = axes[1, 1]
    branch_performance = results_df.groupby('Branch')['MAE'].mean().sort_values()
    bars5 = ax5.bar(range(len(branch_performance)), branch_performance.values, 
                   color='purple', alpha=0.7)
    ax5.set_xticks(range(len(branch_performance)))
    ax5.set_xticklabels(branch_performance.index, rotation=45, ha='right')
    ax5.set_ylabel('Average MAE')
    ax5.set_title('Average Performance by Branch')
    ax5.grid(True, alpha=0.3)
    
    # Add value labels on bars
    for i, bar in enumerate(bars5):
        height = bar.get_height()
        ax5.text(bar.get_x() + bar.get_width()/2, height + height*0.01, 
                f'{height:.1f}', ha='center', va='bottom', fontsize=10)
    
    # 6. Performance by tonnage
    ax6 = axes[1, 2]
    tonnage_performance = results_df.groupby('Tonnage')['MAE'].mean().sort_values()
    bars6 = ax6.bar(range(len(tonnage_performance)), tonnage_performance.values, 
                   color='orange', alpha=0.7)
    ax6.set_xticks(range(len(tonnage_performance)))
    ax6.set_xticklabels([f'{t}T' for t in tonnage_performance.index])
    ax6.set_ylabel('Average MAE')
    ax6.set_title('Average Performance by Tonnage')
    ax6.grid(True, alpha=0.3)
    
    # Add value labels on bars
    for i, bar in enumerate(bars6):
        height = bar.get_height()
        ax6.text(bar.get_x() + bar.get_width()/2, height + height*0.01, 
                f'{height:.1f}', ha='center', va='bottom', fontsize=10)
    
    plt.tight_layout()
    plt.show()
    
    # Save the plot
    plot_file = 'outputs/multi_segment_performance_analysis.png'
    plt.savefig(plot_file, dpi=300, bbox_inches='tight')
    print(f"Performance analysis plot saved to: {plot_file}")
    
    # Create model inventory
    print(f"\\n{'='*60}")
    print("MODEL INVENTORY")
    print(f"{'='*60}")
    
    model_inventory = []
    for _, row in results_df.iterrows():
        segment = row['Segment']
        for col in results_df.columns:
            if col.endswith('_Model_File') and pd.notna(row[col]):
                model_type = col.replace('_Model_File', '')
                model_inventory.append({
                    'Segment': segment,
                    'Model_Type': model_type,
                    'File_Path': row[col],
                    'File_Exists': os.path.exists(row[col]) if pd.notna(row[col]) else False
                })
    
    if model_inventory:
        inventory_df = pd.DataFrame(model_inventory)
        inventory_file = 'outputs/multi_segment_model_inventory.csv'
        inventory_df.to_csv(inventory_file, index=False)
        print(f"Model inventory saved to: {inventory_file}")
        
        # Check file existence
        existing_files = inventory_df['File_Exists'].sum()
        total_files = len(inventory_df)
        print(f"Model files status: {existing_files}/{total_files} files exist")
        
        if existing_files < total_files:
            missing_files = inventory_df[~inventory_df['File_Exists']]
            print(f"Missing files:")
            for _, row in missing_files.iterrows():
                print(f"  - {row['Segment']} - {row['Model_Type']}: {row['File_Path']}")
    
    print(f"\\n{'='*60}")
    print("ANALYSIS COMPLETE")
    print(f"{'='*60}")
    print(f"✅ Results saved to: {results_file}")
    print(f"✅ Predictions saved to: outputs/multi_segment_predictions_*.csv")
    print(f"✅ Models saved to: outputs/models/multi_segment/*.pkl")
    print(f"✅ Visualizations saved to: {plot_file}")
    print(f"✅ Model inventory saved to: {inventory_file}")
    
else:
    print("\\nNo results available for visualization.")


\nNo results available for visualization.


## Final Recommendations and Model Usage

### Key Findings
Based on the multi-segment ensemble training results:

1. **Best Performing Segments**: The models with the lowest MAE, RMSE, and MAPE values
2. **Model Performance**: All trained models are saved and can be loaded for future predictions
3. **Training Efficiency**: Training times and success rates across different segments

### Model Loading and Usage Examples

```python
# Example: Load models for a specific segment
segment_name = "chennai_1_5_3star"  # Replace with actual segment name
models = load_saved_models(segment_name)

# Example: Make predictions using loaded models
# (This would require implementing prediction functions for each model type)
```

### Deployment Recommendations

1. **Production Use**: Deploy the best performing models based on MAE scores
2. **Regular Retraining**: Retrain models every 3-6 months with new data
3. **Monitoring**: Track model performance and retrain if accuracy degrades
4. **Model Selection**: Use the Ridge meta-model ensemble for best accuracy

### File Structure Created

```
outputs/
├── multi_segment_results.csv          # Main results summary
├── multi_segment_model_inventory.csv  # Model file inventory
├── multi_segment_performance_analysis.png  # Performance visualizations
├── data/                              # Segment-specific train/test data
│   ├── {segment}_train.csv
│   └── {segment}_test.csv
├── models/multi_segment/              # All trained models
│   ├── {segment}_prophet_weather.pkl
│   ├── {segment}_holtwinters.pkl
│   ├── {segment}_prophet_univariate.pkl
│   └── {segment}_ridge_meta.pkl
└── multi_segment_predictions_{segment}.csv  # Detailed predictions
```

### Next Steps

1. Review the results and identify the best performing segments
2. Deploy the top-performing models for production use
3. Set up monitoring for model performance
4. Plan regular retraining schedule
5. Consider expanding to additional segment combinations if needed
