This project aims to predict breast cancer using the Support Vector Machine (SVM) algorithm on the breast cancer dataset. It includes data loading, model preparation, k-fold cross-validation without hyperparameter tuning, hyperparameter tuning using GridSearchCV with StratifiedKFold, and evaluation on the test dataset.
The project involves using the Support Vector Machine (SVM) algorithm to predict breast cancer based on the Wisconsin Diagnostic Breast Cancer (WDBC) dataset. The process includes several steps:
- Data Loading and Model Preparation: The breast cancer dataset is loaded, and the data is prepared for modeling.
- k-fold Cross-Validation: The model's performance is estimated without hyperparameter tuning using k-fold cross-validation to ensure the model's effectiveness across different subsets of the data.
- Hyperparameter Tuning: Using GridSearchCV with StratifiedKFold, the model's hyperparameters are fine-tuned to improve performance.
- Evaluation on Test Dataset: The tuned model is then evaluated on a separate test dataset to gauge its predictive capability.
The project achieved the following results:
- Mean cross-validation score (before hyperparameter tuning):
0.8969
- Best cross-validation score (after hyperparameter tuning):
0.9497
- Final evaluation on the test dataset showed:
- Accuracy:
0.9649
- Sensitivity:
0.9630
- Specificity:
0.9683
- Accuracy:
These results indicate a high level of model performance in predicting breast cancer occurrences.
To set up this project, you'll need Python and the following libraries:
- Pandas
- NumPy
- Scikit-learn
You can install the dependencies with:
pip install pandas numpy scikit-learn
To use this project, follow these steps:
- Load the breast cancer dataset.
- Prepare the data and split it into training and test sets.
- Perform k-fold cross-validation to estimate the model's performance.
- Tune the hyperparameters using GridSearchCV with StratifiedKFold.
- Evaluate the model on the test dataset.
Refer to the Jupyter notebooks for detailed code and explanations.
This project is licensed under the MIT License - see the LICENSE.md file for details.
The dataset used in this project is cited as follows:
Wolberg, William, Mangasarian, Olvi, Street, Nick, and Street, W. (1995). Breast Cancer Wisconsin (Diagnostic). UCI Machine Learning Repository. https://doi.org/10.24432/C5DW2B.