This project focuses on applying a genetic algorithm to optimize machine learning models for detecting three diseases:
- Breast Cancer
- Parkinson's Disease
- PCOS (Polycystic Ovary Syndrome)
The datasets used in this project are sourced from Kaggle:
- Breast Cancer Wisconsin Data: Features extracted from breast mass aspirates.
- Parkinson's Disease Detection: Vocal measurements from healthy and Parkinson's patients.
- PCOS Dataset: Clinical measurements for PCOS diagnosis.
- Logistic Regression
- Random Forest
- AdaBoost
- Decision Tree
- K-Nearest Neighbors
- Support Vector Machine (Linear)
- Support Vector Machine (RBF)
The genetic algorithm is specifically applied to logistic regression, resulting in an accuracy improvement from 96.5% to 98.6%.
- Random Forest
- AdaBoost
- Gradient Boosting
- Decision Tree
- KNN
- Support Vector Machine (Linear)
- Support Vector Machine (RBF)
Genetic algorithm optimization, specifically with gradient boosting, improves accuracy from 89.8% to 93.9%.
- Random Forest
- AdaBoost
- Gradient Boosting
- Logistic Regression
- Decision Tree
- Support Vector Machine (Linear)
- Support Vector Machine (RBF)
- KNN
Genetic algorithm optimization enhances KNN accuracy from 84.6% to 88.2%.
- Initialization: Random population of binary chromosomes indicating selected/not selected features.
- Fitness Calculation: Calculate fitness scores based on model accuracy.
- Selection: Choose the best-scoring chromosomes.
- Crossover: Create a new population through chromosome crossover.
- Mutation: Introduce random mutations in chromosomes.
- Repeat: Iteratively repeat steps 2-5 for multiple generations.
Application of the genetic algorithm optimization provides notable improvements:
- Breast Cancer: Logistic Regression accuracy improves from 96.5% to 98.6%
- Parkinson's Disease: Gradient Boosting accuracy improves from 89.8% to 93.9%
- PCOS: KNN accuracy improves from 84.6% to 88.2%
The optimal feature subsets selected by the genetic algorithm are detailed in the notebook.
The IPython notebook contains the full code implementation. To use:
- Install requirements:
numpy
,pandas
,scikit-learn
,matplotlib
- Run Jupyter notebook
- Run cells in order
The outputs include accuracy scores, confusion matrices, and feature importance graphs.
Using a genetic algorithm to optimize feature selection improves machine learning models for disease detection across different datasets. This demonstrates the effectiveness of the approach in healthcare applications.