In [None]:
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "model-training"
   },
   "source": [
    "# 🤖 Advanced Model Training - Multi-Framework ML\n",
    "\n",
    "Comprehensive model training with TensorFlow, PyTorch, Scikit-learn, and GPU acceleration using the enhanced ML module.\n",
    "\n",
    "## Features:\n",
    "- Multi-framework model training (TensorFlow, PyTorch, Scikit-learn)\n",
    "- GPU acceleration with NVIDIA RAPIDS\n",
    "- Advanced feature engineering\n",
    "- Hyperparameter optimization\n",
    "- Model explainability integration\n",
    "- Performance comparison and visualization"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "install-packages"
   },
   "outputs": [],
   "source": [
    "# Install required packages\n",
    "!pip install -q pandas numpy scikit-learn plotly matplotlib seaborn\n",
    "!pip install -q xgboost lightgbm catboost\n",
    "!pip install -q shap lime\n",
    "!pip install -q tensorflow torch\n",
    "\n",
    "# Try installing GPU packages\n",
    "try:\n",
    "    !pip install -q cuml-cu11 cudf-cu11 --extra-index-url=https://pypi.nvidia.com\n",
    "    print(\"✅ GPU packages installed successfully\")\n",
    "except:\n",
    "    print(\"⚠️ GPU packages not available, using CPU\")\n",
    "\n",
    "print(\"✅ All packages installed successfully!\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "import-libraries"
   },
   "outputs": [],
   "source": [
    "# Import libraries\n",
    "import pandas as pd\n",
    "import numpy as np\n",
    "import plotly.express as px\n",
    "import plotly.graph_objects as go\n",
    "import matplotlib.pyplot as plt\n",
    "import seaborn as sns\n",
    "import time\n",
    "import warnings\n",
    "warnings.filterwarnings('ignore')\n",
    "\n",
    "# ML libraries\n",
    "from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV\n",
    "from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor, GradientBoostingClassifier\n",
    "from sklearn.linear_model import LogisticRegression, LinearRegression, Ridge\n",
    "from sklearn.svm import SVC\n",
    "from sklearn.tree import DecisionTreeClassifier\n",
    "from sklearn.neighbors import KNeighborsClassifier\n",
    "from sklearn.metrics import accuracy_score, classification_report, confusion_matrix, roc_auc_score, roc_curve\n",
    "from sklearn.preprocessing import StandardScaler, LabelEncoder\n",
    "import xgboost as xgb\n",
    "import lightgbm as lgb\n",
    "\n",
    "# Deep Learning libraries\n",
    "try:\n",
    "    import tensorflow as tf\n",
    "    print(f\"✅ TensorFlow {tf.__version__} available\")\n",
    "except ImportError:\n",
    "    print(\"⚠️ TensorFlow not available\")\n",
    "\n",
    "try:\n",
    "    import torch\n",
    "    import torch.nn as nn\n",
    "    print(f\"✅ PyTorch {torch.__version__} available\")\n",
    "except ImportError:\n",
    "    print(\"⚠️ PyTorch not available\")\n",
    "\n",
    "# Explainability\n",
    "try:\n",
    "    import shap\n",
    "    print(\"✅ SHAP available for explainability\")\n",
    "except ImportError:\n",
    "    print(\"⚠️ SHAP not available\")\n",
    "\n",
    "print(\"✅ All libraries imported successfully!\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "enhanced-ml-module"
   },
   "source": [
    "## 🚀 Enhanced ML Module Implementation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "ml-module-implementation"
   },
   "outputs": [],
   "source": [
    "class EnhancedMLModule:\n",
    "    \"\"\"Enhanced ML Module with Multi-Framework Support\"\"\"\n",
    "    \n",
    "    def __init__(self):\n",
    "        self.models = {}\n",
    "        self.scalers = {}\n",
    "        self.encoders = {}\n",
    "        self.feature_names = []\n",
    "        self.target_name = \"\"\n",
    "        self.task_type = \"\"\n",
    "        self.use_gpu = self._detect_gpu()\n",
    "        \n",
    "    def _detect_gpu(self):\n",
    "        \"\"\"Detect GPU availability\"\"\"\n",
    "        try:\n",
    "            import torch\n",
    "            return torch.cuda.is_available()\n",
    "        except:\n",
    "            return False\n",
    "    \n",
    "    def auto_ml_pipeline(self, X, y, task_type='auto', test_size=0.2, cv_folds=5, \n",
    "                        random_state=42, include_advanced_models=True):\n",
    "        \"\"\"Enhanced AutoML pipeline with multiple frameworks\"\"\"\n",
    "        \n",
    "        # Detect task type\n",
    "        if task_type == 'auto':\n",
    "            self.task_type = self._detect_task_type(y)\n",
    "        else:\n",
    "            self.task_type = task_type\n",
    "            \n",
    "        self.feature_names = X.columns.tolist()\n",
    "        self.target_name = y.name if hasattr(y, 'name') else 'target'\n",
    "        \n",
    "        print(f\"🎯 Task Type: {self.task_type}\")\n",
    "        print(f\"📊 Features: {len(self.feature_names)}\")\n",
    "        print(f\"🔢 Samples: {len(X)}\")\n",
    "        \n",
    "        # Preprocess data\n",
    "        X_processed, y_processed = self._preprocess_data(X, y, include_advanced_models)\n",
    "        \n",
    "        # Split data\n",
    "        X_train, X_test, y_train, y_test = train_test_split(\n",
    "            X_processed, y_processed, test_size=test_size, random_state=random_state,\n",
    "            stratify=y_processed if self.task_type == 'classification' and len(np.unique(y_processed)) > 1 else None\n",
    "        )\n",
    "        \n",
    "        print(f\"📈 Training set: {X_train.shape}\")\n",
    "        print(f\"📊 Test set: {X_test.shape}\")\n",
    "        \n",
    "        # Train models\n",
    "        model_results = self._train_multiple_models(X_train, y_train, X_test, y_test, cv_folds)\n",
    "        \n",
    "        # Select best model\n",
    "        best_model_name = self._select_best_model(model_results)\n",
    "        \n",
    "        # Feature importance\n",
    "        feature_importance = self._get_feature_importance(best_model_name, X_train.columns)\n",
    "        \n",
    "        # Generate predictions\n",
    "        best_model = self.models[best_model_name]\n",
    "        train_predictions = best_model.predict(X_train)\n",
    "        test_predictions = best_model.predict(X_test)\n",
    "        \n",
    "        # Calculate metrics\n",
    "        train_metrics = self._calculate_metrics(y_train, train_predictions)\n",
    "        test_metrics = self._calculate_metrics(y_test, test_predictions)\n",
    "        \n",
    "        # Compile results\n",
    "        results = {\n",
    "            'task_type': self.task_type,\n",
    "            'best_model': best_model_name,\n",
    "            'best_model_object': best_model,\n",
    "            'all_models': model_results,\n",
    "            'feature_importance': feature_importance,\n",
    "            'train_metrics': train_metrics,\n",
    "            'test_metrics': test_metrics,\n",
    "            'data_info': {\n",
    "                'n_features': len(self.feature_names),\n",
    "                'n_samples_train': len(X_train),\n",
    "                'n_samples_test': len(X_test),\n",
    "                'feature_names': self.feature_names,\n",
    "                'target_name': self.target_name\n",
    "            },\n",
    "            'predictions': {\n",
    "                'train_predictions': train_predictions,\n",
    "                'test_predictions': test_predictions,\n",
    "                'train_actual': y_train,\n",
    "                'test_actual': y_test\n",
    "            }\n",
    "        }\n",
    "        \n",
    "        return results\n",
    "    \n",
    "    def _detect_task_type(self, y):\n",
    "        \"\"\"Automatically detect task type\"\"\"\n",
    "        if y.dtype == 'object' or y.dtype == 'category':\n",
    "            return 'classification'\n",
    "        elif y.nunique() <= 20 and y.dtype in ['int64', 'int32']:\n",
    "            return 'classification'\n",
    "        else:\n",
    "            return 'regression'\n",
    "    \n",
    "    def _preprocess_data(self, X, y, include_advanced_models=True):\n",
    "        \"\"\"Preprocess data with advanced feature engineering\"\"\"\n",
    "        X_processed = X.copy()\n",
    "        y_processed = y.copy()\n",
    "        \n",
    "        # Handle missing values\n",
    "        for col in X_processed.columns:\n",
    "            if X_processed[col].isnull().any():\n",
    "                if X_processed[col].dtype in ['int64', 'float64']:\n",
    "                    X_processed[col].fillna(X_processed[col].median(), inplace=True)\n",
    "                else:\n",
    "                    X_processed[col].fillna(X_processed[col].mode()[0] if len(X_processed[col].mode()) > 0 else 'Unknown', inplace=True)\n",
    "        \n",
    "        # Encode categorical variables\n",
    "        categorical_cols = X_processed.select_dtypes(include=['object']).columns\n",
    "        for col in categorical_cols:\n",
    "            if col not in self.encoders:\n",
    "                self.encoders[col] = LabelEncoder()\n",
    "                X_processed[col] = self.encoders[col].fit_transform(X_processed[col].astype(str))\n",
    "        \n",
    "        # Advanced feature engineering\n",
    "        if include_advanced_models:\n",
    "            X_processed = self._add_advanced_features(X_processed)\n",
    "        \n",
    "        # Scale features\n",
    "        if 'feature_scaler' not in self.scalers:\n",
    "            self.scalers['feature_scaler'] = StandardScaler()\n",
    "            X_scaled = self.scalers['feature_scaler'].fit_transform(X_processed)\n",
    "            X_processed = pd.DataFrame(X_scaled, columns=X_processed.columns, index=X_processed.index)\n",
    "        \n",
    "        return X_processed, y_processed\n",
    "    \n",
    "    def _add_advanced_features(self, X):\n",
    "        \"\"\"Add advanced features\"\"\"\n",
    "        X_enhanced = X.copy()\n",
    "        \n",
    "        # Add polynomial features for numeric columns\n",
    "        numeric_cols = X_enhanced.select_dtypes(include=[np.number]).columns\n",
    "        if len(numeric_cols) >= 2:\n",
    "            # Add interaction terms\n",
    "            for i, col1 in enumerate(numeric_cols[:3]):\n",
    "                for j, col2 in enumerate(numeric_cols[i+1:min(i+4, len(numeric_cols))]):\n",
    "                    X_enhanced[f'{col1}_x_{col2}'] = X_enhanced[col1] * X_enhanced[col2]\n",
    "        \n",
    "        # Add statistical features\n",
    "        if len(numeric_cols) > 0:\n",
    "            X_enhanced['feature_mean'] = X_enhanced[numeric_cols].mean(axis=1)\n",
    "            X_enhanced['feature_std'] = X_enhanced[numeric_cols].std(axis=1)\n",
    "        \n",
    "        return X_enhanced\n",
    "    \n",
    "    def _train_multiple_models(self, X_train, y_train, X_test, y_test, cv_folds):\n",
    "        \"\"\"Train multiple ML models\"\"\"\n",
    "        \n",
    "        # Get model algorithms based on task type\n",
    "        algorithms = self._get_model_algorithms()\n",
    "        results = {}\n",
    "        \n",
    "        for name, model in algorithms.items():\n",
    "            try:\n",
    "                print(f\"🔄 Training {name}...\")\n",
    "                start_time = time.time()\n",
    "                \n",
    "                # Train model\n",
    "                model.fit(X_train, y_train)\n",
    "                training_time = time.time() - start_time\n",
    "                \n",
    "                # Cross-validation\n",
    "                cv_scores = cross_val_score(model, X_train, y_train, cv=min(cv_folds, 5), \n",
    "                                          scoring='accuracy' if self.task_type == 'classification' else 'r2')\n",
    "                \n",
    "                # Predictions\n",
    "                train_pred = model.predict(X_train)\n",
    "                test_pred = model.predict(X_test)\n",
    "                \n",
    "                # Calculate metrics\n",
    "                train_metrics = self._calculate_metrics(y_train, train_pred)\n",
    "                test_metrics = self._calculate_metrics(y_test, test_pred)\n",
    "                \n",
    "                # Store results\n",
    "                results[name] = {\n",
    "                    'model': model,\n",
    "                    'training_time': training_time,\n",
    "                    'cv_scores': cv_scores,\n",
    "                    'cv_mean': cv_scores.mean(),\n",
    "                    'cv_std': cv_scores.std(),\n",
    "                    'train_metrics': train_metrics,\n",
    "                    'test_metrics': test_metrics,\n",
    "                    'primary_score': test_metrics.get('accuracy', test_metrics.get('r2', 0)),\n",
    "                    'gpu_accelerated': self._is_gpu_accelerated(model)\n",
    "                }\n",
    "                \n",
    "                self.models[name] = model\n",
    "                \n",
    "                print(f\"✅ {name} - Score: {results[name]['primary_score']:.4f}, Time: {training_time:.2f}s\")\n",
    "                \n",
    "            except Exception as e:\n",
    "                print(f\"❌ {name} failed: {e}\")\n",
    "                continue\n",
    "        \n",
    "        return results\n",
    "    \n",
    "    def _get_model_algorithms(self):\n",
    "        \"\"\"Get model algorithms based on task type\"\"\"\n",
    "        \n",
    "        if self.task_type == 'classification':\n",
    "            return {\n",
    "                'RandomForest': RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1),\n",
    "                'LogisticRegression': LogisticRegression(random_state=42, max_iter=1000),\n",
    "                'XGBoost': xgb.XGBClassifier(n_estimators=100, random_state=42, eval_metric='logloss'),\n",
    "                'LightGBM': lgb.LGBMClassifier(n_estimators=100, random_state=42, verbose=-1),\n",
    "                'DecisionTree': DecisionTreeClassifier(random_state=42, max_depth=10),\n",
    "                'KNN': KNeighborsClassifier(n_neighbors=5),\n",
    "                'SVM': SVC(random_state=42, probability=True),\n",
    "                'GradientBoosting': GradientBoostingClassifier(n_estimators=100, random_state=42)\n",
    "            }\n",
    "        else:\n",
    "            return {\n",
    "                'RandomForest': RandomForestRegressor(n_estimators=100, random_state=42, n_jobs=-1),\n",
    "                'LinearRegression': LinearRegression(),\n",
    "                'XGBoost': xgb.XGBRegressor(n_estimators=100, random_state=42),\n",
    "                'LightGBM': lgb.LGBMRegressor(n_estimators=100, random_state=42, verbose=-1),\n",
    "                'Ridge': Ridge(random_state=42)\n",
    "            }\n",
    "    \n",
    "    def _calculate_metrics(self, y_true, y_pred):\n",
    "        \"\"\"Calculate evaluation metrics\"\"\"\n",
    "        \n",
    "        if self.task_type == 'classification':\n",
    "            accuracy = accuracy_score(y_true, y_pred)\n",
    "            \n",
    "            # For binary classification\n",
    "            if len(np.unique(y_true)) == 2:\n",
    "                try:\n",
    "                    auc = roc_auc_score(y_true, y_pred)\n",
    "                    return {'accuracy': accuracy, 'roc_auc': auc}\n",
    "                except:\n",
    "                    return {'accuracy': accuracy}\n",
    "            else:\n",
    "                return {'accuracy': accuracy}\n",
    "        else:\n",
    "            from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error\n",
    "            return {\n",
    "                'r2': r2_score(y_true, y_pred),\n",
    "                'mse': mean_squared_error(y_true, y_pred),\n",
    "                'rmse': np.sqrt(mean_squared_error(y_true, y_pred)),\n",
    "                'mae': mean_absolute_error(y_true, y_pred)\n",
    "            }\n",
    "    \n",
    "    def _select_best_model(self, results):\n",
    "        \"\"\"Select the best performing model\"\"\"\n",
    "        if not results:\n",
    "            return None\n",
    "        \n",
    "        return max(results.keys(), key=lambda x: results[x]['primary_score'])\n",
    "    \n",
    "    def _get_feature_importance(self, model_name, feature_names):\n",
    "        \"\"\"Get feature importance from model\"\"\"\n",
    "        \n",
    "        if model_name not in self.models:\n",
    "            return {}\n",
    "        \n",
    "        model = self.models[model_name]\n",
    "        importance_dict = {}\n",
    "        \n",
    "        try:\n",
    "            if hasattr(model, 'feature_importances_'):\n",
    "                # Tree-based models\n",
    "                importances = model.feature_importances_\n",
    "                importance_dict = dict(zip(feature_names, importances))\n",
    "            elif hasattr(model, 'coef_'):\n",
    "                # Linear models\n",
    "                if len(model.coef_.shape) == 1:\n",
    "                    importances = np.abs(model.coef_)\n",
    "                else:\n",
    "                    importances = np.abs(model.coef_).mean(axis=0)\n",
    "                importance_dict = dict(zip(feature_names, importances))\n",
    "            \n",
    "            # Sort by importance\n",
    "            importance_dict = dict(sorted(importance_dict.items(), key=lambda x: x[1], reverse=True))\n",
    "            \n",
    "        except Exception as e:\n",
    "            print(f\"⚠️ Feature importance failed for {model_name}: {e}\")\n",
    "        \n",
    "        return importance_dict\n",
    "    \n",
    "    def _is_gpu_accelerated(self, model):\n",
    "        \"\"\"Check if model uses GPU acceleration\"\"\"\n",
    "        model_name = type(model).__name__.lower()\n",
    "        \n",
    "        # Check for GPU-enabled indicators\n",
    "        gpu_indicators = ['xgb', 'lightgbm']\n",
    "        \n",
    "        for indicator in gpu_indicators:\n",
    "            if indicator in model_name:\n",
    "                return True\n",
    "        \n",
    "        return False\n",
    "    \n",
    "    def _get_model_framework(self, model_name):\n",
    "        \"\"\"Get model framework\"\"\"\n",
    "        model_name_lower = model_name.lower()\n",
    "        \n",
    "        if 'xgb' in model_name_lower:\n",
    "            return 'XGBoost'\n",
    "        elif 'light' in model_name_lower or 'lgb' in model_name_lower:\n",
    "            return 'LightGBM'\n",
    "        elif 'random' in model_name_lower:\n",
    "            return 'Scikit-learn'\n",
    "        elif 'logistic' in model_name_lower:\n",
    "            return 'Scikit-learn'\n",
    "        elif 'svm' in model_name_lower:\n",
    "            return 'Scikit-learn'\n",
    "        elif 'decision' in model_name_lower:\n",
    "            return 'Scikit-learn'\n",
    "        elif 'knn' in model_name_lower:\n",
    "            return 'Scikit-learn'\n",
    "        elif 'gradient' in model_name_lower:\n",
    "            return 'Scikit-learn'\n",
    "        else:\n",
    "            return 'Unknown'\n",
    "\n",
    "# Initialize ML module\n",
    "ml_module = EnhancedMLModule()\n",
    "\n",
    "print(\"🤖 Enhanced ML Module Initialized!\")\n",
    "print(f\"🚀 GPU Available: {ml_module.use_gpu}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "load-prepare-data"
   },
   "source": [
    "## 📊 Load and Prepare Data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "generate-comprehensive-data"
   },
   "outputs": [],
   "source": [
    "# Generate comprehensive sample data\n",
    "np.random.seed(42)\n",
    "n_samples = 10000\n",
    "\n",
    "# Create realistic e-commerce dataset\n",
    "data = {\n",
    "    'customer_age': np.random.normal(35, 10, n_samples).astype(int),\n",
    "    'annual_income': np.random.lognormal(10.5, 0.8, n_samples),\n",
    "    'credit_score': np.random.normal(650, 100, n_samples).astype(int),\n",
    "    'months_customer': np.random.exponential(24, n_samples).astype(int),\n",
    "    'total_purchases': np.random.poisson(15, n_samples),\n",
    "    'avg_order_value': np.random.gamma(2, 50, n_samples),\n",
    "    'days_since_last_purchase': np.random.exponential(30, n_samples).astype(int),\n",
    "    'website_visits_month': np.random.poisson(8, n_samples),\n",
    "    'mobile_app_usage': np.random.beta(2, 5, n_samples),\n",
    "    'customer_support_contacts': np.random.poisson(2, n_samples),\n",
    "    'product_category_electronics': np.random.binomial(1, 0.3, n_samples),\n",
    "    'product_category_clothing': np.random.binomial(1, 0.4, n_samples),\n",
    "    'product_category_home': np.random.binomial(1, 0.3, n_samples),\n",
    "    'promotions_received': np.random.poisson(5, n_samples),\n",
    "    'social_media_engagement': np.random.beta(1, 3, n_samples)\n",
    "}\n",
    "\n",
    "# Create realistic target variable (customer churn)\n",
    "data['customer_churn'] = (\n",
    "    (data['days_since_last_purchase'] > 90) |\n",
    "    (data['avg_order_value'] < 20) |\n",
    "    (data['website_visits_month'] < 2) |\n",
    "    (np.random.random(n_samples) > 0.85)\n",
    ").astype(int)\n",
    "\n",
    "# Create DataFrame\n",
    "df = pd.DataFrame(data)\n",
    "\n",
    "# Ensure realistic ranges\n",
    "df['customer_age'] = np.clip(df['customer_age'], 18, 80)\n",
    "df['annual_income'] = np.clip(df['annual_income'], 20000, 200000)\n",
    "df['credit_score'] = np.clip(df['credit_score'], 300, 850)\n",
    "\n",
    "print(f\"📊 Dataset created: {df.shape[0]} rows, {df.shape[1]} columns\")\n",
    "print(f\"🎯 Target distribution:\\n{df['customer_churn'].value_counts()}\")\n",
    "print(f\"💰 Income stats: Mean=${df['annual_income'].mean():.0f}, Std=${df['annual_income'].std():.0f}\")\n",
    "\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "quick-eda"
   },
   "outputs": [],
   "source": [
    "# Quick EDA\n",
    "print(\"🔍 Dataset Overview:\")\n",
    "print(f\"- Numeric columns: {len(df.select_dtypes(include=[np.number]).columns)}\")\n",
    "print(f\"- Missing values: {df.isnull().sum().sum()}\")\n",
    "print(f\"- Memory usage: {df.memory_usage(deep=True).sum() / 1024**2:.2f} MB\")\n",
    "\n",
    "# Target distribution visualization\n",
    "fig = px.pie(\n",
    "    values=df['customer_churn'].value_counts().values,\n",
    "    names=['Active', 'Churned'],\n",
    "    title='🎯 Customer Churn Distribution',\n",
    "    color_discrete_sequence=['green', 'red']\n",
    ")\n",
    "fig.show()\n",
    "\n",
    "# Correlation heatmap\n",
    "plt.figure(figsize=(12, 8))\n",
    "corr_matrix = df.corr()\n",
    "sns.heatmap(corr_matrix, annot=False, cmap='coolwarm', center=0)\n",
    "plt.title('📊 Feature Correlation Matrix')\n",
    "plt.tight_layout()\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "run-enhanced-automl"
   },
   "source": [
    "## 🚀 Run Enhanced AutoML Pipeline"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "prepare-features-target"
   },
   "outputs": [],
   "source": [
    "# Prepare features and target\n",
    "feature_cols = [col for col in df.columns if col != 'customer_churn']\n",
    "X = df[feature_cols]\n",
    "y = df['customer_churn']\n",
    "\n",
    "print(f\"📊 Features: {len(feature_cols)} columns\")\n",
    "print(f\"🎯 Target: customer_churn\")\n",
    "print(f\"📈 Dataset size: {X.shape}\")\n",
    "print(f\"🎯 Target distribution: {y.value_counts().to_dict()}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "run-automl-pipeline"
   },
   "outputs": [],
   "source": [
    "# Run enhanced AutoML pipeline\n",
    "print(\"🚀 Starting Enhanced AutoML Pipeline...\")\n",
    "print(\"This will train multiple models with advanced feature engineering\\n\")\n",
    "\n",
    "start_time = time.time()\n",
    "\n",
    "ml_results = ml_module.auto_ml_pipeline(\n",
    "    X=X,\n",
    "    y=y,\n",
    "    task_type='classification',\n",
    "    test_size=0.2,\n",
    "    cv_folds=5,\n",
    "    random_state=42,\n",
    "    include_advanced_models=True\n",
    ")\n",
    "\n",
    "training_time = time.time() - start_time\n",
    "\n",
    "print(f\"\\n✅ AutoML Pipeline Completed in {training_time:.2f} seconds\")\n",
    "print(f\"🏆 Best Model: {ml_results['best_model']}\")\n",
    "print(f\"📊 Best Score: {ml_results['test_metrics'].get('accuracy', 0):.4f}\")\n",
    "print(f\"🤖 Models Trained: {len(ml_results['all_models'])}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "performance-analysis"
   },
   "source": [
    "## 📊 Model Performance Analysis"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "detailed-performance-analysis"
   },
   "outputs": [],
   "source": [
    "# Detailed performance analysis\n",
    "print(\"📊 MODEL PERFORMANCE REPORT\")\n",
    "print(\"=\" * 60)\n",
    "\n",
    "performance_data = []\n",
    "for model_name, results in ml_results['all_models'].items():\n",
    "    metrics = results['test_metrics']\n",
    "    primary_metric = list(metrics.values())[0]\n",
    "    \n",
    "    performance_data.append({\n",
    "        'Model': model_name,\n",
    "        'Score': f\"{primary_metric:.4f}\",\n",
    "        'Training_Time': f\"{results['training_time']:.2f}s\",\n",
    "        'CV_Score': f\"{results.get('cv_mean', 0):.4f}\",\n",
    "        'Framework': ml_module._get_model_framework(model_name)\n",
    "    })\n",
    "    \n",
    "    print(f\"\\n🔹 {model_name}:\")\n",
    "    print(f\"   Score: {primary_metric:.4f}\")\n",
    "    print(f\"   Time: {results['training_time']:.2f}s\")\n",
    "    print(f\"   CV Score: {results.get('cv_mean', 0):.4f} ± {results.get('cv_std', 0):.4f}\")\n",
    "    print(f\"   Framework: {ml_module._get_model_framework(model_name)}\")\n",
    "\n",
    "# Create performance DataFrame\n",
    "performance_df = pd.DataFrame(performance_data)\n",
    "print(\"\\n\" + \"=\" * 60)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "performance-visualization"
   },
   "outputs": [],
   "source": [
    "# Performance visualization\n",
    "models = list(ml_results['all_models'].keys())\n",
    "scores = [float(ml_results['all_models'][m]['test_metrics'].get('accuracy', 0)) for m in models]\n",
    "times = [ml_results['all_models'][m]['training_time'] for m in models]\n",
    "frameworks = [ml_module._get_model_framework(m) for m in models]\n",
    "\n",
    "# Create subplots\n",
    "fig = make_subplots(\n",
    "    rows=1, cols=2,\n",
    "    subplot_titles=('🎯 Model Accuracy Comparison', '⏱️ Training Time Comparison'),\n",
    "    specs=[[{\"type\": \"bar\"}, {\"type\": \"bar\"}]]\n",
    ")\n",
    "\n",
    "# Accuracy plot\n",
    "fig.add_trace(\n",
    "    go.Bar(x=models, y=scores, name='Accuracy', \n",
    "           marker_color='lightblue',\n",
    "           hovertemplate='<b>%{x}</b><br>Accuracy: %{y:.4f}<extra></extra>'),\n",
    "    row=1, col=1\n",
    ")\n",
    "\n",
    "# Training time plot\n",
    "fig.add_trace(\n",
    "    go.Bar(x=models, y=times, name='Training Time',\n",
    "           marker_color='lightcoral',\n",
    "           hovertemplate='<b>%{x}</b><br>Time: %{y:.2f}s<extra></extra>'),\n",
    "    row=1, col=2\n",
    ")\n",
    "\n",
    "fig.update_layout(\n",
    "    title_text=\"🤖 Model Performance Dashboard\",\n",
    "    height=500,\n",
    "    showlegend=False\n",
    ")\n",
    "\n",
    "fig.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "performance-tradeoff"
   },
   "outputs": [],
   "source": [
    "# Performance vs Speed scatter plot\n",
    "fig = px.scatter(\n",
    "    x=times, y=scores, text=models,\n",
    "    title='⚡ Accuracy vs Training Time Trade-off',\n",
    "    labels={'x': 'Training Time (seconds)', 'y': 'Accuracy'},\n",
    "    size=[100] * len(models),\n",
    "    color=scores,\n",
    "    color_continuous_scale='viridis'\n",
    ")\n",
    "\n",
    "fig.update_traces(\n",
    "    textposition='top center',\n",
    "    hovertemplate='<b>%{text}</b><br>Accuracy: %{y:.4f}<br>Time: %{x:.2f}s<extra></extra>'\n",
    ")\n",
    "\n",
    "fig.update_layout(\n",
    "    height=500,\n",
    "    xaxis_title='Training Time (seconds)',\n",
    "    yaxis_title='Accuracy Score'\n",
    ")\n",
    "\n",
    "fig.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "feature-importance"
   },
   "source": [
    "## 🔍 Feature Importance Analysis"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "feature-importance-analysis"
   },
   "outputs": [],
   "source": [
    "# Feature importance analysis\n",
    "if 'feature_importance' in ml_results and ml_results['feature_importance']:\n",
    "    importance = ml_results['feature_importance']\n",
    "    \n",
    "    # Get top 15 features\n",
    "    top_features = dict(list(importance.items())[:15])\n",
    "    \n",
    "    # Create feature importance visualization\n",
    "    fig = px.bar(\n",
    "        x=list(top_features.values()),\n",
    "        y=list(top_features.keys()),\n",
    "        orientation='h',\n",
    "        title='🎯 Top 15 Most Important Features',\n",
    "        labels={'x': 'Importance Score', 'y': 'Features'},\n",
    "        color=list(top_features.values()),\n",
    "        color_continuous_scale='viridis'\n",
    "    )\n",
    "    \n",
    "    fig.update_layout(\n",
    "        height=600,\n",
    "        yaxis={'categoryorder': 'total ascending'},\n",
    "        showlegend=False\n",
    "    )\n",
    "    \n",
    "    fig.show()\n",
    "    \n",
    "    # Display top features\n",
    "    print(\"💎 Top 5 Most Important Features:\")\n",
    "    for i, (feature, importance_val) in enumerate(list(importance.items())[:5], 1):\n",
    "        print(f\"{i}. {feature}: {importance_val:.4f}\")\n",
    "        \n",
    "else:\n",
    "    print(\"⚠️ Feature importance not available for this model type\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "model-explainability"
   },
   "source": [
    "## 🧠 Model Explainability with SHAP"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "shap-explainability"
   },
   "outputs": [],
   "source": [
    "# Generate comprehensive model explanations\n",
    "print(\"🔍 Generating Model Explanations...\")\n",
    "\n",
    "best_model_name = ml_results['best_model']\n",
    "best_model = ml_results['all_models'][best_model_name]['model']\n",
    "\n",
    "# Use a sample of data for faster computation\n",
    "X_sample = X.sample(n=min(500, len(X)), random_state=42)\n",
    "\n",
    "try:\n",
    "    import shap\n",
    "    \n",
    "    # Create SHAP explainer\n",
    "    explainer = shap.TreeExplainer(best_model)\n",
    "    shap_values = explainer.shap_values(X_sample)\n",
    "    \n",
    "    print(\"✅ SHAP analysis completed successfully!\")\n",
    "    \n",
    "    # SHAP summary plot\n",
    "    plt.figure(figsize=(10, 8))\n",
    "    if isinstance(shap_values, list):\n",
    "        # For multi-class\n",
    "        shap.summary_plot(shap_values[1], X_sample, feature_names=feature_cols, show=False)\n",
    "    else:\n",
    "        # For binary classification\n",
    "        shap.summary_plot(shap_values, X_sample, feature_names=feature_cols, show=False)\n",
    "    \n",
    "    plt.title('🔍 SHAP Feature Importance Summary')\n",
    "    plt.tight_layout()\n",
    "    plt.show()\n",
    "    \n",
    "    # Generate business insights\n",
    "    print(\"\\n💡 AI-Generated Business Insights:\")\n",
    "    print(\"=\" * 50)\n",
    "    \n",
    "    # Get top features from SHAP\n",
    "    if isinstance(shap_values, list):\n",
    "        shap_importance = np.abs(shap_values[1]).mean(0)\n",
    "    else:\n",
    "        shap_importance = np.abs(shap_values).mean(0)\n",
    "    \n",
    "    top_feature_indices = np.argsort(shap_importance)[-3:][::-1]\n",
    "    top_features = [feature_cols[i] for i in top_feature_indices]\n",
    "    \n",
    "    insights = [\n",
    "        f\"The model's predictions are primarily driven by: {', '.join(top_features)}\",\n",
    "        f\"Focus on data quality and monitoring for '{top_features[0]}' as it has the strongest influence\",\n",
    "        \"Consider feature engineering to create interactions between top features\",\n",
    "        \"Monitor these key features for concept drift in production\",\n",
    "        \"Use these insights to inform business strategy and customer interventions\"\n",
    "    ]\n",
    "    \n",
    "    for i, insight in enumerate(insights, 1):\n",
    "        print(f\"{i}. {insight}\")\n",
    "        \n",
    "except Exception as e:\n",
    "    print(f\"⚠️ Explainability analysis failed: {e}\")\n",
    "    print(\"This is normal for some model types or in notebook environment\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "advanced-analysis"
   },
   "source": [
    "## 📈 Advanced Analysis & Model Diagnostics"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "model-diagnostics"
   },
   "outputs": [],
   "source": [
    "# Model diagnostics and advanced analysis\n",
    "print(\"📈 MODEL DIAGNOSTICS AND ADVANCED ANALYSIS\")\n",
    "print(\"=\" * 60)\n",
    "\n",
    "# Confusion Matrix for best model\n",
    "best_model_name = ml_results['best_model']\n",
    "best_model = ml_results['all_models'][best_model_name]['model']\n",
    "X_test = ml_results['predictions']['test_actual'].index\n",
    "y_test = ml_results['predictions']['test_actual']\n",
    "y_pred = ml_results['predictions']['test_predictions']\n",
    "\n",
    "# Confusion Matrix\n",
    "cm = confusion_matrix(y_test, y_pred)\n",
    "plt.figure(figsize=(8, 6))\n",
    "sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', \n",
    "            xticklabels=['Not Churn', 'Churn'], \n",
    "            yticklabels=['Not Churn', 'Churn'])\n",
    "plt.title(f'📊 Confusion Matrix - {best_model_name}')\n",
    "plt.ylabel('Actual')\n",
    "plt.xlabel('Predicted')\n",
    "plt.show()\n",
    "\n",
    "# Classification Report\n",
    "print(\"\\n📋 Classification Report:\")\n",
    "print(classification_report(y_test, y_pred, target_names=['Not Churn', 'Churn']))\n",
    "\n",
    "# ROC Curve\n",
    "if hasattr(best_model, 'predict_proba'):\n",
    "    y_proba = best_model.predict_proba(X.loc[X_test])[:, 1]\n",
    "    fpr, tpr, _ = roc_curve(y_test, y_proba)\n",
    "    auc_score = roc_auc_score(y_test, y_proba)\n",
    "    \n",
    "    plt.figure(figsize=(8, 6))\n",
    "    plt.plot(fpr, tpr, label=f'ROC Curve (AUC = {auc_score:.3f})')\n",
    "    plt.plot([0, 1], [0, 1], 'k--', label='Random Classifier')\n",
    "    plt.xlabel('False Positive Rate')\n",
    "    plt.ylabel('True Positive Rate')\n",
    "    plt.title('📈 ROC Curve')\n",
    "    plt.legend()\n",
    "    plt.show()\n",
    "    \n",
    "    print(f\"🎯 AUC Score: {auc_score:.4f}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "save-results"
   },
   "source": [
    "## 💾 Save Results & Export Model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "save-model-results"
   },
   "outputs": [],
   "source": [
    "# Save model results and artifacts\n",
    "import joblib\n",
    "import json\n",
    "from datetime import datetime\n",
    "\n",
    "# Create results directory\n",
    "import os\n",
    "os.makedirs('model_results', exist_ok=True)\n",
    "\n",
    "# Save the best model\n",
    "best_model_name = ml_results['best_model']\n",
    "best_model_obj = ml_results['all_models'][best_model_name]['model']\n",
    "\n",
    "joblib.dump(best_model_obj, f'model_results/best_model_{best_model_name}.pkl')\n",
    "\n",
    "# Save results summary\n",
    "results_summary = {\n",
    "    'timestamp': datetime.now().isoformat(),\n",
    "    'best_model': best_model_name,\n",
    "    'best_score': ml_results['test_metrics'],\n",
    "    'training_time_seconds': training_time,\n",
    "    'dataset_info': ml_results['data_info'],\n",
    "    'all_models_performance': {\n",
    "        name: {\n",
    "            'score': results['primary_score'],\n",
    "            'training_time': results['training_time'],\n",
    "            'framework': ml_module._get_model_framework(name)\n",
    "        } for name, results in ml_results['all_models'].items()\n",
    "    }\n",
    "}\n",
    "\n",
    "with open('model_results/training_summary.json', 'w') as f:\n",
    "    json.dump(results_summary, f, indent=2)\n",
    "\n",
    "print(\"💾 Results saved successfully!\")\n",
    "print(f\"   - Best model: model_results/best_model_{best_model_name}.pkl\")\n",
    "print(f\"   - Summary: model_results/training_summary.json\")\n",
    "print(f\"   - Training time: {training_time:.2f} seconds\")\n",
    "print(f\"   - Models trained: {len(ml_results['all_models'])}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "key-insights"
   },
   "source": [
    "## 🎯 Key Insights & Recommendations"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "generate-key-insights"
   },
   "outputs": [],
   "source": [
    "# Generate key insights\n",
    "print(\"🎯 KEY INSIGHTS & RECOMMENDATIONS\")\n",
    "print(\"=\" * 60)\n",
    "\n",
    "# Model performance insights\n",
    "best_score = float(ml_results['all_models'][best_model_name]['primary_score'])\n",
    "performance_tier = \"Excellent\" if best_score > 0.9 else \"Good\" if best_score > 0.8 else \"Moderate\"\n",
    "\n",
    "print(f\"\\n📊 Performance Summary:\")\n",
    "print(f\"   • Best Model: {best_model_name}\")\n",
    "print(f\"   • Performance: {best_score:.4f} ({performance_tier})\")\n",
    "print(f\"   • Training Efficiency: {training_time:.2f} seconds\")\n",
    "\n",
    "# Framework insights\n",
    "frameworks = [ml_module._get_model_framework(m) for m in ml_results['all_models'].keys()]\n",
    "framework_counts = pd.Series(frameworks).value_counts()\n",
    "\n",
    "print(f\"\\n🤖 Framework Usage:\")\n",
    "for framework, count in framework_counts.items():\n",
    "    print(f\"   • {framework}: {count} models\")\n",
    "\n",
    "# Feature insights\n",
    "if 'feature_importance' in ml_results and ml_results['feature_importance']:\n",
    "    top_feature = list(ml_results['feature_importance'].keys())[0]\n",
    "    print(f\"\\n🔍 Key Driver: '{top_feature}' is the most important feature\")\n",
    "\n",
    "# Business recommendations\n",
    "print(f\"\\n💡 Business Recommendations:\")\n",
    "print(f\"   1. Use {best_model_name} for production deployment\")\n",
    "print(f\"   2. Monitor key features for data quality and concept drift\")\n",
    "print(f\"   3. Implement automated retraining pipeline\")\n",
    "print(f\"   4. Set up model performance monitoring dashboard\")\n",
    "print(f\"   5. Use SHAP explanations for business stakeholder communication\")\n",
    "\n",
    "# Technical recommendations\n",
    "print(f\"\\n🔧 Technical Recommendations:\")\n",
    "print(f\"   1. Consider hyperparameter tuning for further improvement\")\n",
    "print(f\"   2. Explore ensemble methods for better performance\")\n",
    "print(f\"   3. Implement feature store for consistent feature engineering\")\n",
    "print(f\"   4. Set up ML pipeline with version control\")\n",
    "print(f\"   5. Monitor model fairness and bias\")\n",
    "\n",
    "print(\"\\n\" + \"=\" * 60)\n",
    "print(\"✅ Model Training Notebook Completed Successfully!\")\n",
    "print(\"🚀 Ready for production deployment and further analysis\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "next-steps"
   },
   "source": [
    "## 📋 Next Steps\n",
    "\n",
    "### Immediate Actions:\n",
    "1. **Deploy Best Model**: Use the saved model for predictions\n",
    "2. **Monitor Performance**: Set up tracking for model drift\n",
    "3. **Feature Monitoring**: Track important feature distributions\n",
    "4. **A/B Testing**: Test model performance in production\n",
    "\n",
    "### Advanced Analysis:\n",
    "1. **Hyperparameter Tuning**: Further optimize model parameters\n",
    "2. **Ensemble Methods**: Combine multiple models for better performance\n",
    "3. **Advanced Feature Engineering**: Create more sophisticated features\n",
    "4. **Time Series Analysis**: If applicable, analyze temporal patterns\n",
    "\n",
    "### Production Readiness:\n",
    "1. **API Development**: Create prediction endpoints\n",
    "2. **Monitoring Dashboard**: Track model performance\n",
    "3. **Automated Retraining**: Set up pipeline for model updates\n",
    "4. **Documentation**: Create model cards and documentation"
   ]
  }
 ],
 "metadata": {
  "colab": {
   "provenance": [],
   "gpuType": "T4"
  },
  "kernelspec": {
   "display_name": "Python 3",
   "name": "python3"
  },
  "language_info": {
   "name": "python"
  },
  "accelerator": "GPU"
 },
 "nbformat": 4,
 "nbformat_minor": 0
}