## Hands-on 3C 
#### Build 3 classification models for breast cancer detection using Scikit-learn built-in dataset

In [2]:
# Initialization
%matplotlib inline
from warnings import filterwarnings
filterwarnings('ignore')

In [3]:
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split as split, KFold, cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.feature_selection import SelectKBest
from sklearn.decomposition import PCA

cancer = load_breast_cancer()
feature_names = cancer.feature_names
target_names = cancer.target_names
X = cancer.data
y = cancer.target

To do: 
- Check the number of features in the dataset

In [4]:
print(f'Number of features: {X.shape[1]}')

Number of features: 30


To do: 
- Evaluate the performance of k-Nearest Neighbors, Logistic Regression and Decision Tree on this dataset using 5-fold cross validation.

In [5]:
models = []
models.append(('lgr', LogisticRegression()))
models.append(('knn', KNeighborsClassifier()))
models.append(('dtc', DecisionTreeClassifier(random_state=42)))

kf = KFold(n_splits=5, shuffle=True, random_state=42)
for n, m in models:
    scores = cross_val_score(m, X, y, cv=kf, n_jobs=-1)
    print(f'{n} mean score: {scores.mean():.2%}, std: {scores.std():.2%}')

lgr mean score: 94.20%, std: 2.98%
knn mean score: 93.67%, std: 2.81%
dtc mean score: 93.32%, std: 1.62%


To do: 
- Use Univariate Selection to select 10 best features. Evaluate the performance of k-Nearest Neighbors, Logistic Regression and Decision Tree on these features using 5-fold cross validation.

In [6]:
fs = SelectKBest(k=10)
X2 = fs.fit_transform(X, y)

for n, m in models:
    scores = cross_val_score(m, X2, y, cv=kf, n_jobs=-1)
    print(f'After feature selectio{n} mean score: {scores.mean():.2%}, std: {scores.std():.2%}')

[('lgr', LogisticRegression()), ('knn', KNeighborsClassifier()), ('dtc', DecisionTreeClassifier(random_state=42))]


To do: 
- Use Principle Component Analysis (PCA) to reduce the dimensionality of the original features to 10. Evaluate the performance of k-Nearest Neighbors, Logistic Regression and Decision Tree on these features using 5-fold cross validation.