## **Model selection and comparison**

**Model selection and comparison** are critical steps in the machine learning process that help you choose the best model for a given dataset and problem. These steps involve selecting models, comparing their performance, and tuning them to optimize accuracy, generalization, and other evaluation metrics. Here’s an in-depth explanation of the concepts:

### 1. **Why Model Selection is Important?**

Model selection is important because different algorithms may perform better depending on the nature of the data and the problem at hand. No one algorithm works best for all problems (this is referred to as the "no free lunch" theorem). Some models may generalize better, handle outliers, or deal with missing data differently. Therefore, selecting the right model is key to building a successful machine learning solution.

### 2. **Steps in Model Selection:**

- **Step 1: Define the Problem**: 
  First, identify whether it's a classification, regression, clustering, or another type of problem.
  
- **Step 2: Preprocess the Data**: 
  Clean the dataset, handle missing values, encode categorical variables, and normalize the data if required. This step ensures that your models receive clean, consistent input data.

- **Step 3: Choose a Range of Models**:
  Based on the problem type and data, select a range of models that you want to test. For example, for a classification problem, you may want to try:
  - Logistic Regression
  - Decision Trees
  - Random Forest
  - k-Nearest Neighbors (k-NN)
  - Support Vector Machines (SVM)
  - Gradient Boosting methods (like XGBoost)
  
- **Step 4: Use Cross-Validation for Evaluation**:
  Split your data into training and test sets. A common technique for model selection is **k-fold cross-validation**, where the data is split into `k` subsets, and the model is trained on `k-1` subsets and tested on the remaining one. This is repeated `k` times, and the average performance across all folds is used to compare models.
  
- **Step 5: Evaluate Models Using Performance Metrics**:
  Use relevant performance metrics for comparison:
  - For classification: Accuracy, Precision, Recall, F1-score, ROC-AUC
  - For regression: Mean Squared Error (MSE), Mean Absolute Error (MAE), R-squared
  
- **Step 6: Compare Models**:
  Based on cross-validation results and performance metrics, compare how different models perform. Select the model that best balances generalization (performance on unseen data) and overfitting (complexity of the model).

- **Step 7: Hyperparameter Tuning**:
  Once the best-performing model is identified, optimize its hyperparameters using techniques like **GridSearchCV** or **RandomizedSearchCV** to further enhance its performance.

### 3. **Key Metrics for Model Comparison**

- **Accuracy**: The proportion of correctly classified instances.
  
- **Precision and Recall**: These are important metrics when dealing with imbalanced datasets.
  - **Precision**: The proportion of true positives among all predicted positives.
  - **Recall**: The proportion of true positives out of all actual positives.

- **F1-Score**: Harmonic mean of Precision and Recall, especially useful when the dataset is imbalanced.

- **ROC-AUC**: Measures how well the model separates positive and negative classes by plotting the true positive rate (TPR) against the false positive rate (FPR).

- **MSE and RMSE**: Mean Squared Error and Root Mean Squared Error for regression tasks to measure the average squared difference between actual and predicted values.

- **R-squared**: A statistical measure of how well the model’s predictions approximate the actual data points.

### 4. **Overfitting and Underfitting:**
- **Overfitting**: A model is too complex (like a deep decision tree) and performs very well on the training data but poorly on new, unseen data. 
- **Underfitting**: A model is too simple (like linear regression on non-linear data) and fails to capture the patterns in the training data.

The goal of model selection is to find a balance where the model is neither underfitting nor overfitting.

### 5. **Example: Model Selection Using Cross-Validation**

```python
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

# Load dataset
iris = load_iris()
X, y = iris.data, iris.target

# Split dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Initialize models
models = {
    'RandomForest': RandomForestClassifier(),
    'SVM': SVC(),
    'LogisticRegression': LogisticRegression()
}

# Perform 5-fold cross-validation and compare models
for model_name, model in models.items():
    scores = cross_val_score(model, X_train, y_train, cv=5)
    print(f"{model_name} Accuracy: {scores.mean():.3f}")
```

### 6. **Model Comparison**

Once cross-validation is done, the model with the highest cross-validated accuracy can be selected, but this also depends on other metrics (e.g., precision, recall) and business requirements.

### 7. **Model Tuning**

Once the best model is selected, hyperparameter tuning can be performed. For instance, for RandomForest:

```python
from sklearn.model_selection import GridSearchCV

param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [10, 20, 30],
    'min_samples_split': [2, 5, 10]
}

grid_search = GridSearchCV(RandomForestClassifier(), param_grid, cv=5)
grid_search.fit(X_train, y_train)

print(f"Best Parameters: {grid_search.best_params_}")
```

### 8. **Final Comparison and Model Selection**

After tuning, compare the final model's performance on the test set. For example, calculate accuracy, precision, recall, and AUC scores for the chosen model, and make your final selection.

### Summary of Model Selection:

- **Cross-validation** helps ensure that the model generalizes well.
- **Multiple metrics** should be used for model comparison.
- **Hyperparameter tuning** refines the selected model's performance.
- **Business context** and interpretability of the model should also be considered when selecting the final model.

This process ensures that the chosen model is the best suited for your problem, balances performance with simplicity, and generalizes well to new, unseen data.

Import all the required frameworks

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC

Load the data: Breast Cancer Dataset[classification problem]

In [2]:
from sklearn.datasets import load_breast_cancer
data = load_breast_cancer() #loading the data
print(data)

{'data': array([[1.799e+01, 1.038e+01, 1.228e+02, ..., 2.654e-01, 4.601e-01,
        1.189e-01],
       [2.057e+01, 1.777e+01, 1.329e+02, ..., 1.860e-01, 2.750e-01,
        8.902e-02],
       [1.969e+01, 2.125e+01, 1.300e+02, ..., 2.430e-01, 3.613e-01,
        8.758e-02],
       ...,
       [1.660e+01, 2.808e+01, 1.083e+02, ..., 1.418e-01, 2.218e-01,
        7.820e-02],
       [2.060e+01, 2.933e+01, 1.401e+02, ..., 2.650e-01, 4.087e-01,
        1.240e-01],
       [7.760e+00, 2.454e+01, 4.792e+01, ..., 0.000e+00, 2.871e-01,
        7.039e-02]]), 'target': array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,
       0, 0, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0,
       1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0,
       1, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 1,
       1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0,
 

Data Split: Splitting of a give data into input features and target variable

In [3]:
X = data.data
y = data.target
print(X)
print(y)

[[1.799e+01 1.038e+01 1.228e+02 ... 2.654e-01 4.601e-01 1.189e-01]
 [2.057e+01 1.777e+01 1.329e+02 ... 1.860e-01 2.750e-01 8.902e-02]
 [1.969e+01 2.125e+01 1.300e+02 ... 2.430e-01 3.613e-01 8.758e-02]
 ...
 [1.660e+01 2.808e+01 1.083e+02 ... 1.418e-01 2.218e-01 7.820e-02]
 [2.060e+01 2.933e+01 1.401e+02 ... 2.650e-01 4.087e-01 1.240e-01]
 [7.760e+00 2.454e+01 4.792e+01 ... 0.000e+00 2.871e-01 7.039e-02]]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 1 0 0 0 0 0 0 0 0 1 0 1 1 1 1 1 0 0 1 0 0 1 1 1 1 0 1 0 0 1 1 1 1 0 1 0 0
 1 0 1 0 0 1 1 1 0 0 1 0 0 0 1 1 1 0 1 1 0 0 1 1 1 0 0 1 1 1 1 0 1 1 0 1 1
 1 1 1 1 1 1 0 0 0 1 0 0 1 1 1 0 0 1 0 1 0 0 1 0 0 1 1 0 1 1 0 1 1 1 1 0 1
 1 1 1 1 1 1 1 1 0 1 1 1 1 0 0 1 0 1 1 0 0 1 1 0 0 1 1 1 1 0 1 1 0 0 0 1 0
 1 0 1 1 1 0 1 1 0 0 1 0 0 0 0 1 0 0 0 1 0 1 0 1 1 0 1 0 0 0 0 1 1 0 0 1 1
 1 0 1 1 1 1 1 0 0 1 1 0 1 1 0 0 1 0 1 1 1 1 0 1 1 1 1 1 0 1 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 1 1 1 1 1 1 0 1 0 1 1 0 1 1 0 1 0 0 1 1 1 1 1 1 1 1 

Data Splitting into train set and test set

In [6]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

## Data Normalization
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

Data Modeling

In [7]:
models = {
    'Random Forest': RandomForestClassifier(),
    'KNN': KNeighborsClassifier(),
    'Logistic Regression': LogisticRegression(),
    'SVM': SVC(probability=True)
}


# Train models
for name, model in models.items():
    model.fit(X_train, y_train)
    print(f'{name} trained successfully.')

Random Forest trained successfully.
KNN trained successfully.
Logistic Regression trained successfully.
SVM trained successfully.


In [8]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score
# Evaluate models
results = {}

for name, model in models.items():
    y_pred = model.predict(X_test)
    y_prob = model.predict_proba(X_test)[:, 1] if hasattr(model, 'predict_proba') else None

    results[name] = {
        'Accuracy': accuracy_score(y_test, y_pred),
        'Precision': precision_score(y_test, y_pred),
        'Recall': recall_score(y_test, y_pred),
        'F1 Score': f1_score(y_test, y_pred),
        'ROC AUC': roc_auc_score(y_test, y_prob) if y_prob is not None else 'N/A'
    }

# Display results
results_df = pd.DataFrame(results).T
print(results_df)

                     Accuracy  Precision    Recall  F1 Score   ROC AUC
Random Forest        0.956140   0.969697  0.955224  0.962406  0.995554
KNN                  0.956140   0.930556  1.000000  0.964029  0.982375
Logistic Regression  0.964912   0.970149  0.970149  0.970149  0.993331
SVM                  0.982456   0.971014  1.000000  0.985294  0.999047


In [9]:
# Compare models
print('Model Comparison:')
print(results_df)

Model Comparison:
                     Accuracy  Precision    Recall  F1 Score   ROC AUC
Random Forest        0.956140   0.969697  0.955224  0.962406  0.995554
KNN                  0.956140   0.930556  1.000000  0.964029  0.982375
Logistic Regression  0.964912   0.970149  0.970149  0.970149  0.993331
SVM                  0.982456   0.971014  1.000000  0.985294  0.999047
