# Student Success Analytics & Early Intervention System (SSAES)
## Model Training Pipeline

This notebook implements a complete machine learning pipeline for predicting student performance and identifying at-risk students.

**Dataset Location:** `data/demo/your_dataset.csv`

**Installation:** Run `pip install -r requirements.txt` before executing this notebook.

**Outputs:**
- Trained models saved to `models/`
- Evaluation plots saved to `reports/figures/`

## 2. Setup & Imports

Import all necessary libraries and set up the environment for reproducible results.

In [None]:
# Core libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
import os
import sys

# Machine Learning
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, roc_curve
from sklearn.metrics import confusion_matrix, classification_report
import xgboost as xgb
import joblib
import shap

# Add src to path for utils
sys.path.append('../src')
from utils import load_data, save_model, load_model, plot_confusion_matrix

# Settings
np.random.seed(42)
plt.style.use('default')
sns.set_palette("husl")
warnings.filterwarnings('ignore')
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)

print("‚úÖ All libraries imported successfully!")

## 3. Load Dataset

Load the student performance dataset and perform initial exploration.

In [None]:
# Load dataset
data_path = '../data/demo/your_dataset.csv'
df = load_data(data_path)

if df is not None:
    print("\nüìä Dataset Overview:")
    print(f"Shape: {df.shape}")
    print("\nüîç First 5 rows:")
    display(df.head())
    
    print("\nüìã Dataset Info:")
    df.info()
    
    print("\nüìà Statistical Summary:")
    display(df.describe())
    
    # Save sample data
    os.makedirs('../reports/figures', exist_ok=True)
    df.head(10).to_csv('../reports/figures/sample_data.csv', index=False)
    print("\n‚úÖ Sample data saved to reports/figures/sample_data.csv")
else:
    print("\n‚ö†Ô∏è  Please upload your dataset to data/demo/ folder and update the filename above.")
    print("Expected columns: student_id, final_marks, pass_fail, attendance_rate, etc.")

## 4. Quick Data Quality Report

Analyze data quality issues including missing values, duplicates, and class balance.

In [None]:
if df is not None:
    print("üîç Data Quality Assessment\n")
    
    # Missing values analysis
    missing_data = df.isnull().sum()
    missing_percent = (missing_data / len(df)) * 100
    missing_df = pd.DataFrame({
        'Missing Count': missing_data,
        'Missing Percentage': missing_percent
    }).sort_values('Missing Count', ascending=False)
    
    print("üìä Missing Values Summary:")
    display(missing_df[missing_df['Missing Count'] > 0])
    
    # Duplicates check
    duplicates = df.duplicated().sum()
    print(f"\nüîÑ Duplicate rows: {duplicates}")
    
    # Data types
    print("\nüìã Data Types:")
    print(df.dtypes.value_counts())
    
    # Visualize missing data
    if missing_data.sum() > 0:
        plt.figure(figsize=(12, 6))
        sns.heatmap(df.isnull(), cbar=True, yticklabels=False, cmap='viridis')
        plt.title('Missing Data Heatmap')
        plt.tight_layout()
        plt.savefig('../reports/figures/missing_data_heatmap.png', dpi=300, bbox_inches='tight')
        plt.show()
    
    # Class balance for pass_fail (if exists)
    if 'pass_fail' in df.columns:
        plt.figure(figsize=(8, 5))
        df['pass_fail'].value_counts().plot(kind='bar')
        plt.title('Class Distribution: Pass/Fail')
        plt.xlabel('Outcome')
        plt.ylabel('Count')
        plt.xticks(rotation=0)
        plt.tight_layout()
        plt.savefig('../reports/figures/class_balance.png', dpi=300, bbox_inches='tight')
        plt.show()
        
        print(f"\nüìä Class Balance:")
        print(df['pass_fail'].value_counts(normalize=True))

## 5. Preprocessing & Feature Engineering

Clean the data and create features suitable for machine learning models.

In [None]:
if df is not None:
    print("üîß Data Preprocessing & Feature Engineering\n")
    
    # Create a copy for processing
    df_processed = df.copy()
    
    print(f"Original shape: {df_processed.shape}")
    
    # Handle missing values
    numeric_cols = df_processed.select_dtypes(include=[np.number]).columns
    categorical_cols = df_processed.select_dtypes(include=['object']).columns
    
    # Fill numeric missing values with median
    for col in numeric_cols:
        if df_processed[col].isnull().sum() > 0:
            df_processed[col].fillna(df_processed[col].median(), inplace=True)
            print(f"‚úÖ Filled {col} missing values with median")
    
    # Fill categorical missing values with mode
    for col in categorical_cols:
        if df_processed[col].isnull().sum() > 0:
            df_processed[col].fillna(df_processed[col].mode()[0], inplace=True)
            print(f"‚úÖ Filled {col} missing values with mode")
    
    # Remove duplicates
    df_processed.drop_duplicates(inplace=True)
    print(f"‚úÖ Removed duplicates. New shape: {df_processed.shape}")
    
    # Feature Engineering Examples
    
    # Create engagement_index if not present
    if 'engagement_index' not in df_processed.columns and 'attendance_rate' in df_processed.columns:
        # Simple engagement index based on attendance
        df_processed['engagement_index'] = df_processed['attendance_rate'] * 0.7 + np.random.normal(0.3, 0.1, len(df_processed))
        df_processed['engagement_index'] = np.clip(df_processed['engagement_index'], 0, 1)
        print("‚úÖ Created engagement_index feature")
    
    # Create attendance trend (if multiple attendance columns exist)
    attendance_cols = [col for col in df_processed.columns if 'attendance' in col.lower()]
    if len(attendance_cols) > 1:
        df_processed['attendance_trend'] = df_processed[attendance_cols].mean(axis=1)
        print("‚úÖ Created attendance_trend feature")
    
    # Encode categorical variables
    label_encoders = {}
    categorical_cols = df_processed.select_dtypes(include=['object']).columns
    
    for col in categorical_cols:
        if col not in ['student_id']:  # Don't encode ID columns
            le = LabelEncoder()
            df_processed[col] = le.fit_transform(df_processed[col])
            label_encoders[col] = le
            print(f"‚úÖ Label encoded {col}")
    
    print(f"\nüìä Processed dataset shape: {df_processed.shape}")
    print("\nüìã Final data types:")
    print(df_processed.dtypes.value_counts())
    
    # Display processed data sample
    print("\nüîç Processed data sample:")
    display(df_processed.head())

## 6. Train/Test Split

Split the data into training and testing sets with stratification for classification tasks.

In [None]:
if df is not None:
    print("üîÑ Creating Train/Test Split\n")
    
    # Define features and targets
    feature_cols = [col for col in df_processed.columns if col not in ['student_id', 'final_marks', 'pass_fail']]
    X = df_processed[feature_cols]
    
    # Regression target
    if 'final_marks' in df_processed.columns:
        y_reg = df_processed['final_marks']
        print(f"‚úÖ Regression target: final_marks (range: {y_reg.min():.1f} - {y_reg.max():.1f})")
    
    # Classification target
    if 'pass_fail' in df_processed.columns:
        y_class = df_processed['pass_fail']
        print(f"‚úÖ Classification target: pass_fail (classes: {y_class.unique()})")
    
    print(f"\nüìä Features: {len(feature_cols)} columns")
    print(f"Feature names: {feature_cols}")
    
    # Split for regression
    if 'final_marks' in df_processed.columns:
        X_train_reg, X_test_reg, y_train_reg, y_test_reg = train_test_split(
            X, y_reg, test_size=0.2, random_state=42
        )
        print(f"\nüìà Regression split - Train: {X_train_reg.shape[0]}, Test: {X_test_reg.shape[0]}")
    
    # Split for classification (with stratification)
    if 'pass_fail' in df_processed.columns:
        X_train_class, X_test_class, y_train_class, y_test_class = train_test_split(
            X, y_class, test_size=0.2, random_state=42, stratify=y_class
        )
        print(f"üìä Classification split - Train: {X_train_class.shape[0]}, Test: {X_test_class.shape[0]}")
    
    # Scale features
    scaler = StandardScaler()
    
    if 'final_marks' in df_processed.columns:
        X_train_reg_scaled = scaler.fit_transform(X_train_reg)
        X_test_reg_scaled = scaler.transform(X_test_reg)
        print("‚úÖ Features scaled for regression")
    
    if 'pass_fail' in df_processed.columns:
        X_train_class_scaled = scaler.fit_transform(X_train_class)
        X_test_class_scaled = scaler.transform(X_test_class)
        print("‚úÖ Features scaled for classification")
    
    # Save splits
    if 'pass_fail' in df_processed.columns:
        train_df = pd.concat([pd.DataFrame(X_train_class, columns=feature_cols), y_train_class.reset_index(drop=True)], axis=1)
        test_df = pd.concat([pd.DataFrame(X_test_class, columns=feature_cols), y_test_class.reset_index(drop=True)], axis=1)
        
        train_df.to_csv('../data/demo/train.csv', index=False)
        test_df.to_csv('../data/demo/test.csv', index=False)
        print("‚úÖ Train/test splits saved to data/demo/")

## 7. Baseline Models (Regression & Classification)

Train simple baseline models to establish performance benchmarks.

In [None]:
if df is not None:
    print("üéØ Training Baseline Models\n")
    
    # Regression Baseline: Linear Regression
    if 'final_marks' in df_processed.columns:
        print("üìà Regression Baseline: Linear Regression")
        
        lr_reg = LinearRegression()
        lr_reg.fit(X_train_reg_scaled, y_train_reg)
        
        # Predictions
        y_pred_reg = lr_reg.predict(X_test_reg_scaled)
        
        # Metrics
        rmse = np.sqrt(mean_squared_error(y_test_reg, y_pred_reg))
        mae = mean_absolute_error(y_test_reg, y_pred_reg)
        r2 = r2_score(y_test_reg, y_pred_reg)
        
        print(f"  RMSE: {rmse:.3f}")
        print(f"  MAE: {mae:.3f}")
        print(f"  R¬≤: {r2:.3f}")
        
        # Plot residuals
        plt.figure(figsize=(10, 4))
        
        plt.subplot(1, 2, 1)
        plt.scatter(y_test_reg, y_pred_reg, alpha=0.6)
        plt.plot([y_test_reg.min(), y_test_reg.max()], [y_test_reg.min(), y_test_reg.max()], 'r--')
        plt.xlabel('Actual')
        plt.ylabel('Predicted')
        plt.title('Actual vs Predicted')
        
        plt.subplot(1, 2, 2)
        residuals = y_test_reg - y_pred_reg
        plt.scatter(y_pred_reg, residuals, alpha=0.6)
        plt.axhline(y=0, color='r', linestyle='--')
        plt.xlabel('Predicted')
        plt.ylabel('Residuals')
        plt.title('Residual Plot')
        
        plt.tight_layout()
        plt.savefig('../reports/figures/regression_baseline.png', dpi=300, bbox_inches='tight')
        plt.show()
    
    # Classification Baseline: Logistic Regression
    if 'pass_fail' in df_processed.columns:
        print("\nüìä Classification Baseline: Logistic Regression")
        
        lr_class = LogisticRegression(random_state=42, max_iter=1000)
        lr_class.fit(X_train_class_scaled, y_train_class)
        
        # Predictions
        y_pred_class = lr_class.predict(X_test_class_scaled)
        y_pred_proba = lr_class.predict_proba(X_test_class_scaled)[:, 1]
        
        # Metrics
        accuracy = accuracy_score(y_test_class, y_pred_class)
        precision = precision_score(y_test_class, y_pred_class, average='weighted')
        recall = recall_score(y_test_class, y_pred_class, average='weighted')
        f1 = f1_score(y_test_class, y_pred_class, average='weighted')
        
        print(f"  Accuracy: {accuracy:.3f}")
        print(f"  Precision: {precision:.3f}")
        print(f"  Recall: {recall:.3f}")
        print(f"  F1-Score: {f1:.3f}")
        
        # Confusion Matrix
        plot_confusion_matrix(y_test_class, y_pred_class, 
                            save_path='../reports/figures/confusion_matrix_baseline.png',
                            title='Baseline Logistic Regression')
        
        print("\nüìã Classification Report:")
        print(classification_report(y_test_class, y_pred_class))

## 8. Advanced Models & Hyperparameter Tuning

Train Random Forest and XGBoost models with hyperparameter optimization.

In [None]:
if df is not None:
    print("üöÄ Advanced Models & Hyperparameter Tuning\n")
    
    # Store results for comparison
    regression_results = {}
    classification_results = {}
    
    # Add baseline results
    if 'final_marks' in df_processed.columns:
        regression_results['Linear Regression'] = {
            'RMSE': rmse, 'MAE': mae, 'R¬≤': r2
        }
    
    if 'pass_fail' in df_processed.columns:
        classification_results['Logistic Regression'] = {
            'Accuracy': accuracy, 'Precision': precision, 'Recall': recall, 'F1': f1
        }
    
    # Random Forest Regression
    if 'final_marks' in df_processed.columns:
        print("üå≤ Random Forest Regression")
        
        rf_reg_params = {
            'n_estimators': [50, 100],
            'max_depth': [5, 10, None],
            'min_samples_split': [2, 5]
        }
        
        rf_reg = RandomForestRegressor(random_state=42)
        rf_reg_grid = GridSearchCV(rf_reg, rf_reg_params, cv=3, scoring='neg_mean_squared_error', n_jobs=-1)
        rf_reg_grid.fit(X_train_reg, y_train_reg)
        
        print(f"  Best params: {rf_reg_grid.best_params_}")
        
        # Evaluate
        y_pred_rf_reg = rf_reg_grid.predict(X_test_reg)
        rmse_rf = np.sqrt(mean_squared_error(y_test_reg, y_pred_rf_reg))
        mae_rf = mean_absolute_error(y_test_reg, y_pred_rf_reg)
        r2_rf = r2_score(y_test_reg, y_pred_rf_reg)
        
        regression_results['Random Forest'] = {
            'RMSE': rmse_rf, 'MAE': mae_rf, 'R¬≤': r2_rf
        }
        
        print(f"  RMSE: {rmse_rf:.3f}, MAE: {mae_rf:.3f}, R¬≤: {r2_rf:.3f}")
    
    # Random Forest Classification
    if 'pass_fail' in df_processed.columns:
        print("\nüå≤ Random Forest Classification")
        
        rf_class_params = {
            'n_estimators': [50, 100],
            'max_depth': [5, 10, None],
            'min_samples_split': [2, 5]
        }
        
        rf_class = RandomForestClassifier(random_state=42)
        rf_class_grid = GridSearchCV(rf_class, rf_class_params, cv=3, scoring='accuracy', n_jobs=-1)
        rf_class_grid.fit(X_train_class, y_train_class)
        
        print(f"  Best params: {rf_class_grid.best_params_}")
        
        # Evaluate
        y_pred_rf_class = rf_class_grid.predict(X_test_class)
        acc_rf = accuracy_score(y_test_class, y_pred_rf_class)
        prec_rf = precision_score(y_test_class, y_pred_rf_class, average='weighted')
        rec_rf = recall_score(y_test_class, y_pred_rf_class, average='weighted')
        f1_rf = f1_score(y_test_class, y_pred_rf_class, average='weighted')
        
        classification_results['Random Forest'] = {
            'Accuracy': acc_rf, 'Precision': prec_rf, 'Recall': rec_rf, 'F1': f1_rf
        }
        
        print(f"  Accuracy: {acc_rf:.3f}, Precision: {prec_rf:.3f}, Recall: {rec_rf:.3f}, F1: {f1_rf:.3f}")
    
    # XGBoost Regression
    if 'final_marks' in df_processed.columns:
        print("\n‚ö° XGBoost Regression")
        
        xgb_reg_params = {
            'n_estimators': [50, 100],
            'max_depth': [3, 6],
            'learning_rate': [0.1, 0.2]
        }
        
        xgb_reg = xgb.XGBRegressor(random_state=42)
        xgb_reg_grid = GridSearchCV(xgb_reg, xgb_reg_params, cv=3, scoring='neg_mean_squared_error', n_jobs=-1)
        xgb_reg_grid.fit(X_train_reg, y_train_reg)
        
        print(f"  Best params: {xgb_reg_grid.best_params_}")
        
        # Evaluate
        y_pred_xgb_reg = xgb_reg_grid.predict(X_test_reg)
        rmse_xgb = np.sqrt(mean_squared_error(y_test_reg, y_pred_xgb_reg))
        mae_xgb = mean_absolute_error(y_test_reg, y_pred_xgb_reg)
        r2_xgb = r2_score(y_test_reg, y_pred_xgb_reg)
        
        regression_results['XGBoost'] = {
            'RMSE': rmse_xgb, 'MAE': mae_xgb, 'R¬≤': r2_xgb
        }
        
        print(f"  RMSE: {rmse_xgb:.3f}, MAE: {mae_xgb:.3f}, R¬≤: {r2_xgb:.3f}")
    
    # XGBoost Classification
    if 'pass_fail' in df_processed.columns:
        print("\n‚ö° XGBoost Classification")
        
        xgb_class_params = {
            'n_estimators': [50, 100],
            'max_depth': [3, 6],
            'learning_rate': [0.1, 0.2]
        }
        
        xgb_class = xgb.XGBClassifier(random_state=42)
        xgb_class_grid = GridSearchCV(xgb_class, xgb_class_params, cv=3, scoring='accuracy', n_jobs=-1)
        xgb_class_grid.fit(X_train_class, y_train_class)
        
        print(f"  Best params: {xgb_class_grid.best_params_}")
        
        # Evaluate
        y_pred_xgb_class = xgb_class_grid.predict(X_test_class)
        acc_xgb = accuracy_score(y_test_class, y_pred_xgb_class)
        prec_xgb = precision_score(y_test_class, y_pred_xgb_class, average='weighted')
        rec_xgb = recall_score(y_test_class, y_pred_xgb_class, average='weighted')
        f1_xgb = f1_score(y_test_class, y_pred_xgb_class, average='weighted')
        
        classification_results['XGBoost'] = {
            'Accuracy': acc_xgb, 'Precision': prec_xgb, 'Recall': rec_xgb, 'F1': f1_xgb
        }
        
        print(f"  Accuracy: {acc_xgb:.3f}, Precision: {prec_xgb:.3f}, Recall: {rec_xgb:.3f}, F1: {f1_xgb:.3f}")

## 9. Model Evaluation & Comparison

Compare all models and create comprehensive evaluation reports.

In [None]:
if df is not None:
    print("üìä Model Evaluation & Comparison\n")
    
    # Regression Results Table
    if 'final_marks' in df_processed.columns and regression_results:
        print("üìà Regression Models Comparison:")
        reg_comparison = pd.DataFrame(regression_results).T
        reg_comparison = reg_comparison.round(3)
        display(reg_comparison)
        
        # Save comparison
        reg_comparison.to_csv('../reports/figures/regression_model_comparison.csv')
        print("‚úÖ Regression comparison saved")
        
        # Find best model
        best_reg_model = reg_comparison['R¬≤'].idxmax()
        print(f"\nüèÜ Best Regression Model: {best_reg_model} (R¬≤ = {reg_comparison.loc[best_reg_model, 'R¬≤']:.3f})")
    
    # Classification Results Table
    if 'pass_fail' in df_processed.columns and classification_results:
        print("\nüìä Classification Models Comparison:")
        class_comparison = pd.DataFrame(classification_results).T
        class_comparison = class_comparison.round(3)
        display(class_comparison)
        
        # Save comparison
        class_comparison.to_csv('../reports/figures/classification_model_comparison.csv')
        print("‚úÖ Classification comparison saved")
        
        # Find best model
        best_class_model = class_comparison['F1'].idxmax()
        print(f"\nüèÜ Best Classification Model: {best_class_model} (F1 = {class_comparison.loc[best_class_model, 'F1']:.3f})")
        
        # ROC Curve for classification
        if 'pass_fail' in df_processed.columns:
            plt.figure(figsize=(8, 6))
            
            # Plot ROC for each model
            models_to_plot = [
                ('Logistic Regression', lr_class, X_test_class_scaled),
                ('Random Forest', rf_class_grid.best_estimator_, X_test_class),
                ('XGBoost', xgb_class_grid.best_estimator_, X_test_class)
            ]
            
            for name, model, X_test_data in models_to_plot:
                try:
                    y_proba = model.predict_proba(X_test_data)[:, 1]
                    fpr, tpr, _ = roc_curve(y_test_class, y_proba)
                    auc_score = roc_auc_score(y_test_class, y_proba)
                    plt.plot(fpr, tpr, label=f'{name} (AUC = {auc_score:.3f})')
                except:
                    pass
            
            plt.plot([0, 1], [0, 1], 'k--', label='Random')
            plt.xlabel('False Positive Rate')
            plt.ylabel('True Positive Rate')
            plt.title('ROC Curves Comparison')
            plt.legend()
            plt.grid(True, alpha=0.3)
            plt.tight_layout()
            plt.savefig('../reports/figures/roc_curves_comparison.png', dpi=300, bbox_inches='tight')
            plt.show()

## 10. Model Persistence

Save the best performing models for future use.

In [None]:
if df is not None:
    print("üíæ Saving Best Models\n")
    
    # Save best regression model
    if 'final_marks' in df_processed.columns:
        if best_reg_model == 'Random Forest':
            best_regressor = rf_reg_grid.best_estimator_
        elif best_reg_model == 'XGBoost':
            best_regressor = xgb_reg_grid.best_estimator_
        else:
            best_regressor = lr_reg
        
        save_model(best_regressor, '../models/best_regressor.pkl')
        
        # Also save the scaler
        save_model(scaler, '../models/scaler_regression.pkl')
        
        print(f"‚úÖ Best regressor ({best_reg_model}) saved")
    
    # Save best classification model
    if 'pass_fail' in df_processed.columns:
        if best_class_model == 'Random Forest':
            best_classifier = rf_class_grid.best_estimator_
        elif best_class_model == 'XGBoost':
            best_classifier = xgb_class_grid.best_estimator_
        else:
            best_classifier = lr_class
        
        save_model(best_classifier, '../models/best_classifier.pkl')
        
        # Also save the scaler
        save_model(scaler, '../models/scaler_classification.pkl')
        
        print(f"‚úÖ Best classifier ({best_class_model}) saved")
    
    # Save label encoders
    if label_encoders:
        save_model(label_encoders, '../models/label_encoders.pkl')
        print("‚úÖ Label encoders saved")
    
    # Save feature names
    save_model(feature_cols, '../models/feature_names.pkl')
    print("‚úÖ Feature names saved")

## 11. Explainability (SHAP)

Use SHAP to understand model predictions and feature importance.

In [None]:
if df is not None:
    print("üîç Model Explainability with SHAP\n")
    
    # SHAP for classification model
    if 'pass_fail' in df_processed.columns:
        print("üìä SHAP Analysis for Classification Model")
        
        try:
            # Create SHAP explainer
            if best_class_model in ['Random Forest', 'XGBoost']:
                explainer = shap.TreeExplainer(best_classifier)
                shap_values = explainer.shap_values(X_test_class[:100])  # Use first 100 samples
                
                # For binary classification, use class 1 SHAP values
                if isinstance(shap_values, list):
                    shap_values = shap_values[1]
                
            else:  # Logistic Regression
                explainer = shap.LinearExplainer(best_classifier, X_train_class_scaled)
                shap_values = explainer.shap_values(X_test_class_scaled[:100])
            
            # Summary plot
            plt.figure(figsize=(10, 6))
            shap.summary_plot(shap_values, X_test_class[:100], feature_names=feature_cols, show=False)
            plt.tight_layout()
            plt.savefig('../reports/figures/shap_summary_classification.png', dpi=300, bbox_inches='tight')
            plt.show()
            
            # Feature importance plot
            plt.figure(figsize=(10, 6))
            shap.summary_plot(shap_values, X_test_class[:100], feature_names=feature_cols, plot_type="bar", show=False)
            plt.tight_layout()
            plt.savefig('../reports/figures/shap_importance_classification.png', dpi=300, bbox_inches='tight')
            plt.show()
            
            print("‚úÖ SHAP plots saved for classification model")
            
        except Exception as e:
            print(f"‚ö†Ô∏è  SHAP analysis failed: {str(e)}")
            print("Showing feature importance from tree model instead:")
            
            if hasattr(best_classifier, 'feature_importances_'):
                importance_df = pd.DataFrame({
                    'feature': feature_cols,
                    'importance': best_classifier.feature_importances_
                }).sort_values('importance', ascending=False)
                
                plt.figure(figsize=(10, 6))
                sns.barplot(data=importance_df.head(10), x='importance', y='feature')
                plt.title('Top 10 Feature Importances')
                plt.xlabel('Importance')
                plt.tight_layout()
                plt.savefig('../reports/figures/feature_importance_classification.png', dpi=300, bbox_inches='tight')
                plt.show()
    
    # SHAP for regression model (if different from classification)
    if 'final_marks' in df_processed.columns and best_reg_model != best_class_model:
        print("\nüìà SHAP Analysis for Regression Model")
        
        try:
            if best_reg_model in ['Random Forest', 'XGBoost']:
                explainer_reg = shap.TreeExplainer(best_regressor)
                shap_values_reg = explainer_reg.shap_values(X_test_reg[:100])
            else:
                explainer_reg = shap.LinearExplainer(best_regressor, X_train_reg_scaled)
                shap_values_reg = explainer_reg.shap_values(X_test_reg_scaled[:100])
            
            plt.figure(figsize=(10, 6))
            shap.summary_plot(shap_values_reg, X_test_reg[:100], feature_names=feature_cols, show=False)
            plt.tight_layout()
            plt.savefig('../reports/figures/shap_summary_regression.png', dpi=300, bbox_inches='tight')
            plt.show()
            
            print("‚úÖ SHAP plots saved for regression model")
            
        except Exception as e:
            print(f"‚ö†Ô∏è  SHAP analysis failed for regression: {str(e)}")

## 12. Quick Inference Example

Demonstrate how to load and use the trained models for predictions.

In [None]:
if df is not None:
    print("üîÆ Model Inference Example\n")
    
    # Load saved models
    print("üì• Loading saved models...")
    
    if 'pass_fail' in df_processed.columns:
        loaded_classifier = load_model('../models/best_classifier.pkl')
        loaded_scaler = load_model('../models/scaler_classification.pkl')
        loaded_encoders = load_model('../models/label_encoders.pkl')
        loaded_features = load_model('../models/feature_names.pkl')
        
        if all([loaded_classifier, loaded_scaler, loaded_features]):
            print("\nüéØ Classification Inference Example:")
            
            # Take a sample from test set
            sample_idx = 0
            sample_data = X_test_class.iloc[sample_idx:sample_idx+1]
            actual_label = y_test_class.iloc[sample_idx]
            
            print(f"Sample student data:")
            for col, val in sample_data.iloc[0].items():
                print(f"  {col}: {val}")
            
            # Make prediction
            sample_scaled = loaded_scaler.transform(sample_data)
            prediction = loaded_classifier.predict(sample_scaled)[0]
            prediction_proba = loaded_classifier.predict_proba(sample_scaled)[0]
            
            print(f"\nüìä Prediction Results:")
            print(f"  Actual: {actual_label}")
            print(f"  Predicted: {prediction}")
            print(f"  Confidence: {max(prediction_proba):.3f}")
            print(f"  Probabilities: Fail={prediction_proba[0]:.3f}, Pass={prediction_proba[1]:.3f}")
            
            # Risk assessment
            risk_score = 1 - prediction_proba[1]  # Higher risk = lower pass probability
            if risk_score > 0.7:
                risk_level = "üî¥ HIGH RISK"
            elif risk_score > 0.4:
                risk_level = "üü° MEDIUM RISK"
            else:
                risk_level = "üü¢ LOW RISK"
            
            print(f"\n‚ö†Ô∏è  Risk Assessment: {risk_level} (Score: {risk_score:.3f})")
    
    # Regression inference example
    if 'final_marks' in df_processed.columns:
        loaded_regressor = load_model('../models/best_regressor.pkl')
        
        if loaded_regressor:
            print("\nüìà Regression Inference Example:")
            
            sample_data_reg = X_test_reg.iloc[0:1]
            actual_marks = y_test_reg.iloc[0]
            
            # Make prediction
            if best_reg_model == 'Linear Regression':
                sample_scaled_reg = scaler.transform(sample_data_reg)
                predicted_marks = loaded_regressor.predict(sample_scaled_reg)[0]
            else:
                predicted_marks = loaded_regressor.predict(sample_data_reg)[0]
            
            print(f"  Actual marks: {actual_marks:.1f}")
            print(f"  Predicted marks: {predicted_marks:.1f}")
            print(f"  Prediction error: {abs(actual_marks - predicted_marks):.1f}")
    
    print("\n‚úÖ Inference examples completed!")

## 13. Conclusions & Next Steps

### Summary of Results

This notebook has successfully implemented a complete machine learning pipeline for student success prediction:

**Key Achievements:**
- ‚úÖ Data preprocessing and feature engineering
- ‚úÖ Multiple model training and comparison
- ‚úÖ Hyperparameter optimization
- ‚úÖ Model evaluation and selection
- ‚úÖ Model explainability with SHAP
- ‚úÖ Model persistence for deployment

**Best Models:**
- **Classification:** Identifies at-risk students for early intervention
- **Regression:** Predicts final marks for academic planning

### Next Steps for Production Deployment

1. **Django Integration:**
   - Create Django views to load models and make predictions
   - Build REST API endpoints for real-time predictions
   - Implement batch prediction for multiple students

2. **Alert System:**
   - Set up automated alerts for high-risk students
   - Create dashboard for educators and administrators
   - Implement email/SMS notifications

3. **Enhanced Features:**
   - Add more sophisticated feature engineering
   - Implement time-series analysis for trend detection
   - Include external factors (socioeconomic, health, etc.)

4. **Model Monitoring:**
   - Set up model performance monitoring
   - Implement automated retraining pipelines
   - Track prediction accuracy over time

5. **Integration:**
   - Connect with Student Information Systems (SIS)
   - Integrate with Learning Management Systems (LMS)
   - Add real-time data feeds

### Files Generated

- **Models:** `models/best_classifier.pkl`, `models/best_regressor.pkl`
- **Preprocessors:** `models/scaler_*.pkl`, `models/label_encoders.pkl`
- **Evaluations:** `reports/figures/model_comparison.csv`
- **Visualizations:** Various plots in `reports/figures/`

The trained models are now ready for integration into the SSAES web application!