## Hands-on 3C 
#### Build 3 classification models for breast cancer detection using Scikit-learn built-in dataset

In [1]:
# Initialization
%matplotlib inline
from warnings import filterwarnings
filterwarnings('ignore')

In [2]:
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split as split, KFold, cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.feature_selection import SelectKBest
from sklearn.decomposition import PCA

cancer = load_breast_cancer()
feature_names = cancer.feature_names
target_names = cancer.target_names
X = cancer.data
y = cancer.target

To do: 
- Check the number of features in the dataset

In [3]:
print(f"Number of features: {X.shape[1]}")

Number of features: 30


To do: 
- Evaluate the performance of k-Nearest Neighbors, Logistic Regression and Decision Tree on this dataset using 5-fold cross validation.

In [6]:
# Evaluate the performance of various ML algorithms using spot-cheking
models = {}
models["knn"] = KNeighborsClassifier()
models["lgr"] = LogisticRegression()
models["dtc"] = DecisionTreeClassifier(random_state=42)

kf = KFold(n_splits=5, shuffle=True, random_state=42)
for n in models:
    scores = cross_val_score(models[n], X, y, n_jobs=-1, cv=kf)
    print(f"{n}, mean accuracy: {scores.mean():.3%}, std accuracy: {scores.std()}")

knn, mean accuracy: 93.669%, std accuracy: 0.02809807834514385
lgr, mean accuracy: 94.374%, std accuracy: 0.028675104081486897
dtc, mean accuracy: 93.322%, std accuracy: 0.01623576567266633


To do: 
- Use Univariate Selection to select 10 best features. Evaluate the performance of k-Nearest Neighbors, Logistic Regression and Decision Tree on these features using 5-fold cross validation.

In [8]:
fs = SelectKBest(k=10)
Xs = fs.fit_transform(X, y)

print("After feature selection")
for n in models:
    scores = cross_val_score(models[n], Xs, y, n_jobs=-1, cv=kf)
    print(f"{n}, mean accuracy: {scores.mean():.3%}, std accuracy: {scores.std()}")

After feature selection
knn, mean accuracy: 92.967%, std accuracy: 0.027839028415963876
lgr, mean accuracy: 93.844%, std accuracy: 0.033868565007576394
dtc, mean accuracy: 92.265%, std accuracy: 0.013021902231803785


To do: 
- Use Principle Component Analysis (PCA) to reduce the dimensionality of the original features to 10. Evaluate the performance of k-Nearest Neighbors, Logistic Regression and Decision Tree on these features using 5-fold cross validation.

In [9]:
pca = PCA(n_components=10)
Xr = pca.fit_transform(X)

print("After dimensionality reduction")
for n in models:
    scores = cross_val_score(models[n], Xr, y, n_jobs=-1, cv=kf)
    print(f"{n}, mean accuracy: {scores.mean():.3%}, std accuracy: {scores.std()}")

After dimensionality reduction
knn, mean accuracy: 93.669%, std accuracy: 0.02809807834514385
lgr, mean accuracy: 95.426%, std accuracy: 0.015294618659067174
dtc, mean accuracy: 91.390%, std accuracy: 0.033900448367849625
