In [32]:
pip install scikit-learn numpy matplotlib pandas mglearn

Note: you may need to restart the kernel to use updated packages.


Minimal preprocessing needed as dataset is clean with no missing values. Split into training and testing sets.

In [33]:
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split

data = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(data.data, data.target, test_size=0.2, random_state=42)

KNN and RFC from Scikit-learn train on dataset, optimized by GridSearchCV for hyperparameters.

In [34]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV

knn = KNeighborsClassifier()
rf = RandomForestClassifier()

parameters = {'n_neighbors':list(range(1,10)), 'weights':('uniform', 'distance')}
parameters_rf = {'n_estimators':[10, 100, 1000], 'max_depth':[None, 5, 10]}

gs_knn = GridSearchCV(knn, parameters, cv=5)
gs_rf = GridSearchCV(rf, parameters_rf, cv=5)

Evaluation metrics include accuracy, precision, recall, and F1-score. GridSearchCV offers robust evaluation through cross-validation.

In [35]:
gs_knn.fit(X_train, y_train)
gs_rf.fit(X_train, y_train)

print(f"Best parameters for KNN: {gs_knn.best_params_}")
print(f"Best parameters for RFC: {gs_rf.best_params_}")

print(f"KNN Accuracy: {gs_knn.best_score_}")
print(f"Random Forest Accuracy: {gs_rf.best_score_}")

Best parameters for KNN: {'n_neighbors': 9, 'weights': 'distance'}
Best parameters for RFC: {'max_depth': 5, 'n_estimators': 100}
KNN Accuracy: 0.9296703296703297
Random Forest Accuracy: 0.9626373626373625


The evaluation compares KNN and RFC models for a binary classification problem, discussing their strengths and weaknesses handling high dimensionality, class imbalance, and generalization.