## Hands-on 3C 
#### Build 3 classification models for breast cancer detection using Scikit-learn built-in dataset

In [1]:
# Initialization
%matplotlib inline
from warnings import filterwarnings
filterwarnings('ignore')

In [2]:
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split as split, KFold, cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.feature_selection import SelectKBest
from sklearn.decomposition import PCA

cancer = load_breast_cancer()
feature_names = cancer.feature_names
target_names = cancer.target_names
X = cancer.data
y = cancer.target

To do: 
- Check the number of features in the dataset

In [4]:
print(f"Number of data samples: {X.shape[0]}")
print(f"Number of features: {X.shape[1]}")

Number of data samples: 569
Number of features: 30


To do: 
- Evaluate the performance of k-Nearest Neighbors, Logistic Regression and Decision Tree on this dataset using 5-fold cross validation.

In [7]:
models = {}
models['knn'] = KNeighborsClassifier()
models['lgr'] = LogisticRegression()
models['dtc'] = DecisionTreeClassifier()

kf = KFold(n_splits=3, shuffle=True, random_state=42)
for m in models:
    scores = cross_val_score(models[m], X, y, cv=kf, n_jobs=-1)
    print(f"Model: {m}, mean accuracy: {scores.mean():.3%}, std accuracy: {scores.std():.3%}")

Model: knn, mean accuracy: 92.966%, std accuracy: 1.959%
Model: lgr, mean accuracy: 93.315%, std accuracy: 2.886%
Model: dtc, mean accuracy: 92.796%, std accuracy: 0.884%


To do: 
- Use Univariate Selection to select 10 best features. Evaluate the performance of k-Nearest Neighbors, Logistic Regression and Decision Tree on these features using 5-fold cross validation.

In [8]:
selector = SelectKBest(k=10)
Xs = selector.fit_transform(X, y)
print(Xs.shape)

for m in models:
    scores = cross_val_score(models[m], Xs, y, cv=kf, n_jobs=-1)
    print(f"Model: {m}, mean accuracy: {scores.mean():.3%}, std accuracy: {scores.std():.3%}")

(569, 10)
Model: knn, mean accuracy: 92.614%, std accuracy: 2.170%
Model: lgr, mean accuracy: 94.195%, std accuracy: 2.414%
Model: dtc, mean accuracy: 92.443%, std accuracy: 0.239%


To do: 
- Use Principle Component Analysis (PCA) to reduce the dimensionality of the original features to 10. Evaluate the performance of k-Nearest Neighbors, Logistic Regression and Decision Tree on these features using 5-fold cross validation.

In [9]:
pca = PCA(n_components=10)
Xr = pca.fit_transform(X)
print(Xr.shape)

for m in models:
    scores = cross_val_score(models[m], Xr, y, cv=kf, n_jobs=-1)
    print(f"Model: {m}, mean accuracy: {scores.mean():.3%}, std accuracy: {scores.std():.3%}")

(569, 10)
Model: knn, mean accuracy: 92.966%, std accuracy: 1.959%
Model: lgr, mean accuracy: 94.899%, std accuracy: 1.754%
Model: dtc, mean accuracy: 91.915%, std accuracy: 0.507%
