<a href="https://colab.research.google.com/github/sonali6062/Machine_learning_fundamentals/blob/main/Model_selection_and_comparison_with_other_classifiers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
# Import necessary libraries for data manipulation, visualization, and machine learning
import numpy as np # For numerical operations
import pandas  as pd # For data manipulation and analysis
import matplotlib.pyplot as plt # For creating static, interactive, and animated visualizations
import seaborn as sns # For creating informative statistical graphics

# Import machine learning models and tools from scikit-learn
from sklearn.ensemble import RandomForestClassifier # Random Forest classification model
from sklearn.model_selection import train_test_split # Function for splitting data into training and testing sets
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score # Functions for evaluating model performance
from sklearn.neighbors import KNeighborsClassifier # K-Nearest Neighbors classification model
from sklearn.metrics import confusion_matrix # Function to compute confusion matrix
from sklearn.metrics import classification_report # Function to build a text report showing the main classification metrics
from sklearn.preprocessing import StandardScaler # For standardizing features by removing the mean and scaling to unit variance
from sklearn.linear_model import LogisticRegression # Logistic Regression classification model
from sklearn.svm import SVC # Support Vector Classifier

In [2]:
# Load the breast cancer dataset from scikit-learn
from sklearn.datasets import load_breast_cancer
data = load_breast_cancer()

# Print the loaded data (optional, for inspection)
print(data)

{'data': array([[1.799e+01, 1.038e+01, 1.228e+02, ..., 2.654e-01, 4.601e-01,
        1.189e-01],
       [2.057e+01, 1.777e+01, 1.329e+02, ..., 1.860e-01, 2.750e-01,
        8.902e-02],
       [1.969e+01, 2.125e+01, 1.300e+02, ..., 2.430e-01, 3.613e-01,
        8.758e-02],
       ...,
       [1.660e+01, 2.808e+01, 1.083e+02, ..., 1.418e-01, 2.218e-01,
        7.820e-02],
       [2.060e+01, 2.933e+01, 1.401e+02, ..., 2.650e-01, 4.087e-01,
        1.240e-01],
       [7.760e+00, 2.454e+01, 4.792e+01, ..., 0.000e+00, 2.871e-01,
        7.039e-02]]), 'target': array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,
       0, 0, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0,
       1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0,
       1, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 1,
       1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0,
 

Data Split: Splitting of a given daat into imput feature and target variable.

In [3]:
# Separate features (x) and target variable (y)
x=data.data # Features
y=data.target # Target variable (diagnosis)
print(x)
print(y)

[[1.799e+01 1.038e+01 1.228e+02 ... 2.654e-01 4.601e-01 1.189e-01]
 [2.057e+01 1.777e+01 1.329e+02 ... 1.860e-01 2.750e-01 8.902e-02]
 [1.969e+01 2.125e+01 1.300e+02 ... 2.430e-01 3.613e-01 8.758e-02]
 ...
 [1.660e+01 2.808e+01 1.083e+02 ... 1.418e-01 2.218e-01 7.820e-02]
 [2.060e+01 2.933e+01 1.401e+02 ... 2.650e-01 4.087e-01 1.240e-01]
 [7.760e+00 2.454e+01 4.792e+01 ... 0.000e+00 2.871e-01 7.039e-02]]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 1 0 0 0 0 0 0 0 0 1 0 1 1 1 1 1 0 0 1 0 0 1 1 1 1 0 1 0 0 1 1 1 1 0 1 0 0
 1 0 1 0 0 1 1 1 0 0 1 0 0 0 1 1 1 0 1 1 0 0 1 1 1 0 0 1 1 1 1 0 1 1 0 1 1
 1 1 1 1 1 1 0 0 0 1 0 0 1 1 1 0 0 1 0 1 0 0 1 0 0 1 1 0 1 1 0 1 1 1 1 0 1
 1 1 1 1 1 1 1 1 0 1 1 1 1 0 0 1 0 1 1 0 0 1 1 0 0 1 1 1 1 0 1 1 0 0 0 1 0
 1 0 1 1 1 0 1 1 0 0 1 0 0 0 0 1 0 0 0 1 0 1 0 1 1 0 1 0 0 0 0 1 1 0 0 1 1
 1 0 1 1 1 1 1 0 0 1 1 0 1 1 0 0 1 0 1 1 1 1 0 1 1 1 1 1 0 1 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 1 1 1 1 1 1 0 1 0 1 1 0 1 1 0 1 0 0 1 1 1 1 1 1 1 1 

Data Normalization:StandardScaler(stsndard normal form)into 0's and 1's

Data Splitting into train set and test set.

In [4]:
# Split the data into training and testing sets
# test_size=0.2 means 20% of the data will be used for testing, and 80% for training
# random_state=0 ensures reproducibility of the split
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.2,random_state=0)

# Data normalization using StandardScaler
# StandardScaler standardizes features by removing the mean and scaling to unit variance
sc=StandardScaler()
x=sc.fit_transform(x) # Fit and transform the data
print(x)

[[ 1.09706398 -2.07333501  1.26993369 ...  2.29607613  2.75062224
   1.93701461]
 [ 1.82982061 -0.35363241  1.68595471 ...  1.0870843  -0.24388967
   0.28118999]
 [ 1.57988811  0.45618695  1.56650313 ...  1.95500035  1.152255
   0.20139121]
 ...
 [ 0.70228425  2.0455738   0.67267578 ...  0.41406869 -1.10454895
  -0.31840916]
 [ 1.83834103  2.33645719  1.98252415 ...  2.28998549  1.91908301
   2.21963528]
 [-1.80840125  1.22179204 -1.81438851 ... -1.74506282 -0.04813821
  -0.75120669]]


Data Modelling

In [5]:
# Define a dictionary of models to train
models={
    'Random Forest':RandomForestClassifier(),
    'KNN':KNeighborsClassifier(),
    'Logistic Regression':LogisticRegression(),
    'SVM':SVC(probability=True) # SVM with probability estimates enabled
}

# Train each model
for name,model in models.items():
  model.fit(x_train,y_train)
  print(f'{name} trained model')

Random Forest trained model
KNN trained model
Logistic Regression trained model
SVM trained model


STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [6]:
results={}
for name,model in models.items():
  y_pred=model.predict(x_test)
  y_prob=model.predict_proba(x_test)[:,1] if hasattr(model,'predict_proba') else None
  results[name]={
      'accuracy':accuracy_score(y_test,y_pred),
      'Precision':precision_score(y_test,y_pred),
      'Recall':recall_score(y_test,y_pred),
      'F1 Score':f1_score(y_test,y_pred),
      'Confusion Matrix':confusion_matrix(y_test,y_pred),
      'Classification Report':classification_report(y_test,y_pred),
      'ROC AUC':  roc_auc_score(y_test,y_prob) if y_prob is not None else np.nan
  }


#Display Results
result_df=pd.DataFrame(results).T
print(result_df)

                     accuracy Precision    Recall  F1 Score  \
Random Forest         0.95614  0.969697  0.955224  0.962406   
KNN                  0.938596  0.954545  0.940299  0.947368   
Logistic Regression  0.947368  0.969231  0.940299  0.954545   
SVM                  0.929825   0.90411  0.985075  0.942857   

                       Confusion Matrix  \
Random Forest        [[45, 2], [3, 64]]   
KNN                  [[44, 3], [4, 63]]   
Logistic Regression  [[45, 2], [4, 63]]   
SVM                  [[40, 7], [1, 66]]   

                                                 Classification Report  \
Random Forest                      precision    recall  f1-score   ...   
KNN                                precision    recall  f1-score   ...   
Logistic Regression                precision    recall  f1-score   ...   
SVM                                precision    recall  f1-score   ...   

                      ROC AUC  
Random Forest        0.997142  
KNN                  0.957923  
L

In [7]:
# Dictionary to store evaluation results for each model
results={}

# Evaluate each trained model
for name,model in models.items():
  # Make predictions on the test set
  y_pred=model.predict(x_test)

  # Get probability estimates for ROC AUC (if supported by the model)
  y_prob=model.predict_proba(x_test)[:,1] if hasattr(model,'predict_proba') else None

  # Calculate evaluation metrics and store them in the results dictionary
  results[name]={
      'accuracy':accuracy_score(y_test,y_pred),
      'Precision':precision_score(y_test,y_pred),
      'Recall':recall_score(y_test,y_pred),
      'F1 Score':f1_score(y_test,y_pred),
      # 'Confusion Matrix':confusion_matrix(y_test,y_pred), # Confusion matrix (optional)
      # 'Classification Report':classification_report(y_test,y_pred), # Classification report (optional)
      'ROC AUC':  roc_auc_score(y_test,y_prob) if y_prob is not None else np.nan # ROC AUC score
  }


# Convert the results dictionary to a pandas DataFrame for easy display
result_df=pd.DataFrame(results).T

# Display the model comparison table
print('Model Comparison: ')
print(result_df)

Model Comparison: 
                     accuracy  Precision    Recall  F1 Score   ROC AUC
Random Forest        0.956140   0.969697  0.955224  0.962406  0.997142
KNN                  0.938596   0.954545  0.940299  0.947368  0.957923
Logistic Regression  0.947368   0.969231  0.940299  0.954545  0.993649
SVM                  0.929825   0.904110  0.985075  0.942857  0.984440


## Summary and Conclusion

This notebook demonstrates a typical machine learning workflow for classifying breast cancer as malignant or benign using the scikit-learn breast cancer dataset.

The steps involved were:

1.  **Loading the data**: The breast cancer dataset was loaded from `sklearn.datasets`.
2.  **Data Splitting**: The data was split into features (x) and the target variable (y).
3.  **Data Normalization**: The features were standardized using `StandardScaler`.
4.  **Data Splitting**: The data was split into training and testing sets.
5.  **Model Training**: Four different classification models were trained on the training data: Random Forest, K-Nearest Neighbors (KNN), Logistic Regression, and Support Vector Machine (SVM).
6.  **Model Evaluation**: The trained models were evaluated on the testing data using several metrics: accuracy, precision, recall, F1 score, and ROC AUC.

**Conclusion:**

Based on the evaluation metrics, the **Random Forest** model appears to be the best performing model among the ones tested, achieving the highest accuracy, precision, F1 score, and ROC AUC. The SVM model has the highest recall, which means it is best at identifying all positive cases (malignant tumors), but has lower precision. The Logistic Regression model also performed well, but a convergence warning was observed during training, suggesting the need to increase the maximum number of iterations for more reliable results. The KNN model had the lowest performance across most metrics.

Further steps could include:

*   Hyperparameter tuning for the Random Forest and Logistic Regression models to potentially improve their performance.
*   Investigating the false positives and false negatives from the confusion matrices to understand where the models are making errors.
*   Visualizing the results, such as plotting ROC curves, for a more intuitive comparison of model performance.