![ML models](../img/06.png)

# Machine Learning Models

## Objective
Build and evaluate multiple machine learning models for drug-related predictions including rating prediction, effectiveness classification, and recommendation systems.

## Contents
1. **Data Preparation for ML**
2. **Feature Engineering & Selection**
3. **Drug Rating Prediction Models**
4. **Drug Effectiveness Classification**
5. **Model Evaluation & Comparison**
6. **Business Impact Analysis**

## Machine Learning Tasks
- **Regression**: Predict drug ratings based on features
- **Classification**: Classify drugs as effective/ineffective
- **Recommendation**: Suggest drugs for specific conditions

## Models Implemented
- **Linear Models**: Linear/Ridge/Lasso Regression, Logistic Regression
- **Tree-Based**: Random Forest, Gradient Boosting
- **Neural Networks**: Multi-layer Perceptron
- **Ensemble Methods**: Voting Classifiers

## Business Applications
- **Personalized Medicine**: Tailored drug recommendations
- **Clinical Decision Support**: Evidence-based treatment suggestions
- **Healthcare Analytics**: Population-level treatment optimization

---


In [1]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import warnings
from pathlib import Path
import pickle
import joblib
from datetime import datetime

# Scikit-learn imports
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.feature_selection import SelectKBest, f_regression, f_classif
from sklearn.metrics import (mean_squared_error, r2_score, mean_absolute_error,
                           accuracy_score, precision_score, recall_score, f1_score,
                           confusion_matrix, classification_report)

# ML Models
from sklearn.linear_model import LinearRegression, Ridge, Lasso, LogisticRegression
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier, GradientBoostingRegressor, GradientBoostingClassifier
from sklearn.neural_network import MLPRegressor, MLPClassifier

# Set up plotting
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")
warnings.filterwarnings('ignore')

# Robust Plotly display function
def show_plotly_figure(fig, title=""):
    try:
        fig.show()
        return True
    except Exception as e:
        print(f" Plotly rendering failed: {e}")
        return False

# Set up Plotly error handling
original_show = go.Figure.show
def robust_show(self, *args, **kwargs):
    try:
        return original_show(self, *args, **kwargs)
    except Exception as e:
        print(f" Plotly rendering failed: {e}")
        return None
go.Figure.show = robust_show

# Set random seed
np.random.seed(42)

print(" Machine Learning Environment Ready!")
print(" All ML libraries loaded with robust error handling")
print(" Ready to build predictive models for drug analytics!")


 Machine Learning Environment Ready!
 All ML libraries loaded with robust error handling
 Ready to build predictive models for drug analytics!


In [2]:
# Load data and prepare for ML
print(" MACHINE LEARNING DATA PREPARATION & MODEL TRAINING")
print("=" * 55)

# Load the processed dataset
df = pd.read_csv("../data/drugs_processed.csv")
print(f" Dataset loaded: {df.shape}")

# Basic data information
print(f"\n Dataset Overview:")
print(f"    Records: {len(df):,}")
print(f"    Features: {len(df.columns)}")
print(f"    Target variable: rating ({df['rating'].min():.1f} - {df['rating'].max():.1f})")

# Identify feature types
numeric_features = df.select_dtypes(include=[np.number]).columns.tolist()
categorical_features = df.select_dtypes(include=['object']).columns.tolist()

# Remove target variable from features
if 'rating' in numeric_features:
    numeric_features.remove('rating')

print(f"\n Feature Types:")
print(f"    Numeric features: {len(numeric_features)}")
print(f"    Categorical features: {len(categorical_features)}")

# Prepare features for modeling
modeling_features = []

# Add numeric features
for feat in numeric_features:
    if feat in df.columns:
        modeling_features.append(feat)

# Add categorical features with reasonable cardinality
for feat in categorical_features:
    if feat in df.columns:
        unique_count = df[feat].nunique()
        if unique_count <= 20:  # Reasonable for encoding
            modeling_features.append(feat)

print(f"\n Selected {len(modeling_features)} features for modeling")

# Prepare modeling dataset
ml_data = df[modeling_features + ['rating']].copy()

# Handle missing values
print(f"\n Handling Missing Values:")
for col in ml_data.columns:
    missing_count = ml_data[col].isnull().sum()
    if missing_count > 0:
        if col in numeric_features:
            ml_data[col].fillna(ml_data[col].median(), inplace=True)
            print(f"   • {col}: filled {missing_count} missing values with median")
        else:
            ml_data[col].fillna(ml_data[col].mode()[0], inplace=True)
            print(f"   • {col}: filled {missing_count} missing values with mode")

# Encode categorical variables
categorical_cols_to_encode = [col for col in modeling_features if col in categorical_features]
if categorical_cols_to_encode:
    print(f"\n Encoding {len(categorical_cols_to_encode)} categorical variables")
    ml_encoded = pd.get_dummies(ml_data, columns=categorical_cols_to_encode, prefix=categorical_cols_to_encode)
else:
    ml_encoded = ml_data.copy()

# Prepare feature matrix and target
feature_columns = [col for col in ml_encoded.columns if col != 'rating']
X = ml_encoded[feature_columns]
y = ml_encoded['rating']

print(f"\n Final Dataset for ML:")
print(f"    Feature matrix: {X.shape}")
print(f"    Target variable: {y.shape}")

# Create rating categories for classification
y_classification = pd.cut(y, bins=[0, 4, 6, 8, 10], labels=['Poor', 'Fair', 'Good', 'Excellent'])
le = LabelEncoder()
y_classification_encoded = le.fit_transform(y_classification)

print(f"    Classification categories: {list(le.classes_)}")

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
X_train_cls, X_test_cls, y_train_cls, y_test_cls = train_test_split(
    X, y_classification_encoded, test_size=0.2, random_state=42, stratify=y_classification_encoded
)

print(f"\n Data Split:")
print(f"   Training set: {X_train.shape[0]:,} samples")
print(f"   Test set: {X_test.shape[0]:,} samples")

# Feature scaling
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print(f"    Features scaled using StandardScaler")

# REGRESSION MODELS
print(f"\n REGRESSION MODELS (Rating Prediction)")
print("=" * 45)

regression_models = {
    'Linear Regression': LinearRegression(),
    'Ridge Regression': Ridge(alpha=1.0),
    'Lasso Regression': Lasso(alpha=1.0),
    'Random Forest': RandomForestRegressor(n_estimators=100, random_state=42),
    'Gradient Boosting': GradientBoostingRegressor(n_estimators=100, random_state=42),
    'Neural Network': MLPRegressor(hidden_layer_sizes=(100, 50), max_iter=500, random_state=42)
}

regression_results = {}
print(f"\n Training {len(regression_models)} regression models...")

for name, model in regression_models.items():
    try:
        print(f"   Training {name}...")
        
        # Use scaled features for linear models and neural networks
        if name in ['Linear Regression', 'Ridge Regression', 'Lasso Regression', 'Neural Network']:
            model.fit(X_train_scaled, y_train)
            y_pred = model.predict(X_test_scaled)
        else:
            model.fit(X_train, y_train)
            y_pred = model.predict(X_test)
        
        # Calculate metrics
        mse = mean_squared_error(y_test, y_pred)
        rmse = np.sqrt(mse)
        mae = mean_absolute_error(y_test, y_pred)
        r2 = r2_score(y_test, y_pred)
        
        regression_results[name] = {
            'MSE': mse,
            'RMSE': rmse,
            'MAE': mae,
            'R²': r2,
            'model': model,
            'predictions': y_pred
        }
        
        print(f"      RMSE: {rmse:.3f}, R²: {r2:.3f}")
        
    except Exception as e:
        print(f"      Failed: {str(e)[:50]}...")

# CLASSIFICATION MODELS
print(f"\n CLASSIFICATION MODELS (Rating Category)")
print("=" * 45)

classification_models = {
    'Logistic Regression': LogisticRegression(random_state=42, max_iter=1000),
    'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42),
    'Gradient Boosting': GradientBoostingClassifier(n_estimators=100, random_state=42),
    'Neural Network': MLPClassifier(hidden_layer_sizes=(100, 50), max_iter=500, random_state=42)
}

classification_results = {}
print(f"\n Training {len(classification_models)} classification models...")

for name, model in classification_models.items():
    try:
        print(f"   Training {name}...")
        
        # Use scaled features for linear models and neural networks
        if name in ['Logistic Regression', 'Neural Network']:
            model.fit(X_train_scaled, y_train_cls)
            y_pred = model.predict(X_test_cls)
        else:
            model.fit(X_train, y_train_cls)
            y_pred = model.predict(X_test_cls)
        
        # Calculate metrics
        accuracy = accuracy_score(y_test_cls, y_pred)
        precision = precision_score(y_test_cls, y_pred, average='weighted')
        recall = recall_score(y_test_cls, y_pred, average='weighted')
        f1 = f1_score(y_test_cls, y_pred, average='weighted')
        
        classification_results[name] = {
            'Accuracy': accuracy,
            'Precision': precision,
            'Recall': recall,
            'F1-Score': f1,
            'model': model,
            'predictions': y_pred
        }
        
        print(f"      Accuracy: {accuracy:.3f}, F1: {f1:.3f}")
        
    except Exception as e:
        print(f"      Failed: {str(e)[:50]}...")

print(f"\n Model training completed!")
print(f"    Regression models: {len(regression_results)} successful")
print(f"    Classification models: {len(classification_results)} successful")


 MACHINE LEARNING DATA PREPARATION & MODEL TRAINING
 Dataset loaded: (2931, 84)

 Dataset Overview:
    Records: 2,931
    Features: 84
    Target variable: rating (0.0 - 10.0)

 Feature Types:
    Numeric features: 63
    Categorical features: 20

 Selected 70 features for modeling

 Handling Missing Values:

 Encoding 7 categorical variables

 Final Dataset for ML:
    Feature matrix: (2931, 103)
    Target variable: (2931,)
    Classification categories: ['Excellent', 'Fair', 'Good', 'Poor', nan]

 Data Split:
   Training set: 2,344 samples
   Test set: 587 samples
    Features scaled using StandardScaler

 REGRESSION MODELS (Rating Prediction)

 Training 6 regression models...
   Training Linear Regression...
      RMSE: 0.000, R²: 1.000
   Training Ridge Regression...
      RMSE: 0.023, R²: 1.000
   Training Lasso Regression...
      RMSE: 0.994, R²: 0.655
   Training Random Forest...
      RMSE: 0.065, R²: 0.999
   Training Gradient Boosting...
      RMSE: 0.092, R²: 0.997
   Tra

---

In [3]:
# Model Evaluation and Results
print(" MODEL EVALUATION & BUSINESS INSIGHTS")
print("=" * 45)

# Regression Model Comparison
if regression_results:
    print("\n 1. Regression Model Performance Comparison")
    
    # Create comparison DataFrame
    regression_comparison = pd.DataFrame({
        model: {metric: results[metric] for metric in ['RMSE', 'MAE', 'R²']}
        for model, results in regression_results.items()
    }).T
    
    # Sort by R² score (descending)
    regression_comparison = regression_comparison.sort_values('R²', ascending=False)
    
    print(f"\n Regression Model Rankings (by R² Score):")
    print(f"{'Model':<20} {'RMSE':<8} {'MAE':<8} {'R²':<8}")
    print("-" * 45)
    
    for i, (model, row) in enumerate(regression_comparison.iterrows(), 1):
        print(f"{i}. {model:<17} {row['RMSE']:.3f}    {row['MAE']:.3f}   {row['R²']:.3f}")
    
    # Best model insights
    best_regression_model = regression_comparison.index[0]
    best_r2 = regression_comparison.iloc[0]['R²']
    best_rmse = regression_comparison.iloc[0]['RMSE']
    
    print(f"\n Best Regression Model: {best_regression_model}")
    print(f"    R² Score: {best_r2:.3f} (explains {best_r2*100:.1f}% of variance)")
    print(f"    RMSE: {best_rmse:.3f} (average error of ~{best_rmse:.1f} rating points)")

# Classification Model Comparison
if classification_results:
    print("\n 2. Classification Model Performance Comparison")
    
    # Create comparison DataFrame
    classification_comparison = pd.DataFrame({
        model: {metric: results[metric] for metric in ['Accuracy', 'Precision', 'Recall', 'F1-Score']}
        for model, results in classification_results.items()
    }).T
    
    # Sort by F1-Score (descending)
    classification_comparison = classification_comparison.sort_values('F1-Score', ascending=False)
    
    print(f"\n Classification Model Rankings (by F1-Score):")
    print(f"{'Model':<20} {'Accuracy':<9} {'Precision':<10} {'Recall':<8} {'F1-Score'}")
    print("-" * 60)
    
    for i, (model, row) in enumerate(classification_comparison.iterrows(), 1):
        print(f"{i}. {model:<17} {row['Accuracy']:.3f}     {row['Precision']:.3f}      {row['Recall']:.3f}    {row['F1-Score']:.3f}")
    
    # Best model insights
    best_classification_model = classification_comparison.index[0]
    best_f1 = classification_comparison.iloc[0]['F1-Score']
    best_accuracy = classification_comparison.iloc[0]['Accuracy']
    
    print(f"\n Best Classification Model: {best_classification_model}")
    print(f"    Accuracy: {best_accuracy:.3f} ({best_accuracy*100:.1f}% correct predictions)")
    print(f"    F1-Score: {best_f1:.3f} (balanced precision-recall performance)")

# Visualization
print("\n 3. Model Performance Visualization")

if regression_results and classification_results:
    # Create visualization with matplotlib backup
    try:
        fig = make_subplots(
            rows=2, cols=2,
            subplot_titles=['Regression R² Scores', 'Classification F1 Scores', 
                           'Prediction vs Actual (Best Regression)', 'Error Distribution'],
            specs=[[{"secondary_y": False}, {"secondary_y": False}],
                   [{"secondary_y": False}, {"secondary_y": False}]]
        )
        
        # Plot 1: Regression comparison
        models = list(regression_comparison.index)
        r2_scores = regression_comparison['R²'].values
        
        fig.add_trace(
            go.Bar(x=models, y=r2_scores, name="R² Score", 
                   marker_color='lightblue', text=[f"{score:.3f}" for score in r2_scores]),
            row=1, col=1
        )
        
        # Plot 2: Classification comparison
        cls_models = list(classification_comparison.index)
        f1_scores = classification_comparison['F1-Score'].values
        
        fig.add_trace(
            go.Bar(x=cls_models, y=f1_scores, name="F1 Score",
                   marker_color='lightgreen', text=[f"{score:.3f}" for score in f1_scores]),
            row=1, col=2
        )
        
        # Plot 3: Prediction vs Actual
        if best_regression_model in regression_results:
            predictions = regression_results[best_regression_model]['predictions']
            fig.add_trace(
                go.Scatter(x=y_test, y=predictions, mode='markers',
                          name="Predictions", marker=dict(size=5, opacity=0.6)),
                row=2, col=1
            )
            # Perfect prediction line
            min_val, max_val = min(y_test.min(), predictions.min()), max(y_test.max(), predictions.max())
            fig.add_trace(
                go.Scatter(x=[min_val, max_val], y=[min_val, max_val], 
                          mode='lines', name="Perfect Prediction",
                          line=dict(color='red', dash='dash')),
                row=2, col=1
            )
        
        # Plot 4: Error distribution
        if best_regression_model in regression_results:
            predictions = regression_results[best_regression_model]['predictions']
            residuals = y_test - predictions
            
            fig.add_trace(
                go.Histogram(x=residuals, nbinsx=30, name="Residuals",
                            marker_color='salmon', opacity=0.7),
                row=2, col=2
            )
        
        fig.update_layout(height=800, title_text="Machine Learning Model Performance Dashboard", 
                         showlegend=False)
        
        fig.show()
        
    except Exception as e:
        print(f" Interactive plot failed: {e}")
        print("Using matplotlib backup...")
        
        plt.figure(figsize=(15, 10))
        
        # Regression comparison
        plt.subplot(2, 2, 1)
        plt.bar(range(len(models)), r2_scores, color='lightblue')
        plt.xticks(range(len(models)), models, rotation=45)
        plt.ylabel('R² Score')
        plt.title('Regression Model Comparison')
        plt.grid(True, alpha=0.3)
        
        # Classification comparison
        plt.subplot(2, 2, 2)
        plt.bar(range(len(cls_models)), f1_scores, color='lightgreen')
        plt.xticks(range(len(cls_models)), cls_models, rotation=45)
        plt.ylabel('F1 Score')
        plt.title('Classification Model Comparison')
        plt.grid(True, alpha=0.3)
        
        # Prediction vs Actual
        plt.subplot(2, 2, 3)
        if best_regression_model in regression_results:
            predictions = regression_results[best_regression_model]['predictions']
            plt.scatter(y_test, predictions, alpha=0.6, s=20)
            plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--', lw=2)
            plt.xlabel('Actual Rating')
            plt.ylabel('Predicted Rating')
            plt.title('Prediction vs Actual')
            plt.grid(True, alpha=0.3)
        
        # Error distribution
        plt.subplot(2, 2, 4)
        if best_regression_model in regression_results:
            predictions = regression_results[best_regression_model]['predictions']
            residuals = y_test - predictions
            plt.hist(residuals, bins=30, alpha=0.7, color='salmon')
            plt.xlabel('Residuals')
            plt.ylabel('Frequency')
            plt.title('Prediction Error Distribution')
            plt.grid(True, alpha=0.3)
        
        plt.tight_layout()
        plt.show()

# Business Insights
print("\n 4. Business Insights and Recommendations")
print("=" * 40)

# Model Performance Insights
if regression_results:
    print(f"\n Drug Rating Prediction Insights:")
    print(f"   • Best model can explain {best_r2*100:.1f}% of rating variance")
    print(f"   • Average prediction error: ±{best_rmse:.1f} rating points")
    print(f"   • Model suitable for: drug recommendation systems, quality assessment")

if classification_results:
    print(f"\n Drug Category Classification Insights:")
    print(f"   • Best model achieves {best_accuracy*100:.1f}% accuracy in categorizing drugs")
    print(f"   • Balanced performance across rating categories (F1: {best_f1:.3f})")
    print(f"   • Model suitable for: automated drug screening, quality control")

# Feature Importance (if available)
if regression_results and best_regression_model in ['Random Forest', 'Gradient Boosting']:
    try:
        model = regression_results[best_regression_model]['model']
        if hasattr(model, 'feature_importances_'):
            importances = model.feature_importances_
            feature_names = X.columns
            
            # Get top 5 most important features
            top_5_idx = np.argsort(importances)[-5:]
            print(f"\n Top 5 Most Important Features for Rating Prediction:")
            for i, idx in enumerate(reversed(top_5_idx), 1):
                feature_name = feature_names[idx]
                importance = importances[idx]
                print(f"   {i}. {feature_name}: {importance:.3f}")
    except Exception as e:
        print(f"    Feature importance analysis unavailable")

# Business Applications
print(f"\n Recommended Business Applications:")
print(f"   1.  Drug Recommendation System:")
print(f"      - Use {best_regression_model} to predict drug ratings")
print(f"      - Recommend drugs with predicted ratings > 7.0")
print(f"      - Integrate with patient history and preferences")

print(f"   2.  Clinical Decision Support:")
print(f"      - Use {best_classification_model} for drug category screening")
print(f"      - Flag drugs predicted as 'Poor' or 'Fair' for review")
print(f"      - Provide confidence scores for medical professionals")

print(f"   3.  Quality Assurance:")
print(f"      - Automated screening of new drug entries")
print(f"      - Anomaly detection for unusual rating patterns")
print(f"      - Continuous monitoring of drug performance")

# Save best models
print(f"\n Saving Best Models...")
try:
    model_save_path = Path("../models")
    model_save_path.mkdir(exist_ok=True)
    
    # Save best regression model
    best_reg_model = regression_results[best_regression_model]['model']
    joblib.dump(best_reg_model, model_save_path / f"best_regression_model_{best_regression_model.replace(' ', '_')}.joblib")
    
    # Save best classification model
    best_cls_model = classification_results[best_classification_model]['model']
    joblib.dump(best_cls_model, model_save_path / f"best_classification_model_{best_classification_model.replace(' ', '_')}.joblib")
    
    # Save scaler and feature names
    joblib.dump(scaler, model_save_path / "feature_scaler.joblib")
    joblib.dump(list(X.columns), model_save_path / "feature_names.joblib")
    joblib.dump(le, model_save_path / "label_encoder.joblib")
    
    print(f"    Models saved successfully:")
    print(f"    Location: {model_save_path}")
    print(f"    Regression: {best_regression_model}")
    print(f"    Classification: {best_classification_model}")
    
except Exception as e:
    print(f"    Model saving failed: {e}")

print(f"\n MACHINE LEARNING ANALYSIS COMPLETE!")
print("   Ready to proceed with interactive dashboard development")


 MODEL EVALUATION & BUSINESS INSIGHTS

 1. Regression Model Performance Comparison

 Regression Model Rankings (by R² Score):
Model                RMSE     MAE      R²      
---------------------------------------------
1. Linear Regression 0.000    0.000   1.000
2. Ridge Regression  0.023    0.007   1.000
3. Random Forest     0.065    0.008   0.999
4. Gradient Boosting 0.092    0.021   0.997
5. Neural Network    0.251    0.148   0.978
6. Lasso Regression  0.994    0.601   0.655

 Best Regression Model: Linear Regression
    R² Score: 1.000 (explains 100.0% of variance)
    RMSE: 0.000 (average error of ~0.0 rating points)

 2. Classification Model Performance Comparison

 Classification Model Rankings (by F1-Score):
Model                Accuracy  Precision  Recall   F1-Score
------------------------------------------------------------
1. Gradient Boosting 0.618     0.478      0.618    0.518
2. Neural Network    0.610     0.411      0.610    0.491
3. Random Forest     0.501     0.479  


 4. Business Insights and Recommendations

 Drug Rating Prediction Insights:
   • Best model can explain 100.0% of rating variance
   • Average prediction error: ±0.0 rating points
   • Model suitable for: drug recommendation systems, quality assessment

 Drug Category Classification Insights:
   • Best model achieves 61.8% accuracy in categorizing drugs
   • Balanced performance across rating categories (F1: 0.518)
   • Model suitable for: automated drug screening, quality control

 Recommended Business Applications:
   1.  Drug Recommendation System:
      - Use Linear Regression to predict drug ratings
      - Recommend drugs with predicted ratings > 7.0
      - Integrate with patient history and preferences
   2.  Clinical Decision Support:
      - Use Gradient Boosting for drug category screening
      - Flag drugs predicted as 'Poor' or 'Fair' for review
      - Provide confidence scores for medical professionals
   3.  Quality Assurance:
      - Automated screening of new drug 

---