<h1 id="tocheading">Table of Contents</h1>
<div id="toc"></div>

In [1]:
%%javascript
$.getScript('https://kmahelona.github.io/ipython_notebook_goodies/ipython_notebook_toc.js')

<IPython.core.display.Javascript object>

In [2]:
from sklearn import datasets

In [3]:
import matplotlib.pyplot as plt
import seaborn as sb
%matplotlib inline

In [4]:
import numpy as np
import pandas as pd

# Get the data

In [5]:
cancer = datasets.load_breast_cancer()

In [6]:
print(cancer.DESCR)

.. _breast_cancer_dataset:

Breast cancer wisconsin (diagnostic) dataset
--------------------------------------------

**Data Set Characteristics:**

    :Number of Instances: 569

    :Number of Attributes: 30 numeric, predictive attributes and the class

    :Attribute Information:
        - radius (mean of distances from center to points on the perimeter)
        - texture (standard deviation of gray-scale values)
        - perimeter
        - area
        - smoothness (local variation in radius lengths)
        - compactness (perimeter^2 / area - 1.0)
        - concavity (severity of concave portions of the contour)
        - concave points (number of concave portions of the contour)
        - symmetry
        - fractal dimension ("coastline approximation" - 1)

        The mean, standard error, and "worst" or largest (mean of the three
        worst/largest values) of these features were computed for each image,
        resulting in 30 features.  For instance, field 0 is Mean Radi

In [7]:
X = pd.DataFrame(cancer.data, columns=cancer.feature_names)

In [8]:
X.head()

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst radius,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,...,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


In [9]:
y = cancer.target
y

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,
       0, 0, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0,
       1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0,
       1, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 1,
       1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0,
       0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 1,
       1, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0,
       0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0,
       1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 1,
       1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 0, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0,

In [10]:
X.isna().sum().sum()

0

In [11]:
X.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
mean radius,569.0,14.127292,3.524049,6.981,11.7,13.37,15.78,28.11
mean texture,569.0,19.289649,4.301036,9.71,16.17,18.84,21.8,39.28
mean perimeter,569.0,91.969033,24.298981,43.79,75.17,86.24,104.1,188.5
mean area,569.0,654.889104,351.914129,143.5,420.3,551.1,782.7,2501.0
mean smoothness,569.0,0.09636,0.014064,0.05263,0.08637,0.09587,0.1053,0.1634
mean compactness,569.0,0.104341,0.052813,0.01938,0.06492,0.09263,0.1304,0.3454
mean concavity,569.0,0.088799,0.07972,0.0,0.02956,0.06154,0.1307,0.4268
mean concave points,569.0,0.048919,0.038803,0.0,0.02031,0.0335,0.074,0.2012
mean symmetry,569.0,0.181162,0.027414,0.106,0.1619,0.1792,0.1957,0.304
mean fractal dimension,569.0,0.062798,0.00706,0.04996,0.0577,0.06154,0.06612,0.09744


# Preprocessing

In [12]:
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

In [13]:
list(X.columns)

['mean radius',
 'mean texture',
 'mean perimeter',
 'mean area',
 'mean smoothness',
 'mean compactness',
 'mean concavity',
 'mean concave points',
 'mean symmetry',
 'mean fractal dimension',
 'radius error',
 'texture error',
 'perimeter error',
 'area error',
 'smoothness error',
 'compactness error',
 'concavity error',
 'concave points error',
 'symmetry error',
 'fractal dimension error',
 'worst radius',
 'worst texture',
 'worst perimeter',
 'worst area',
 'worst smoothness',
 'worst compactness',
 'worst concavity',
 'worst concave points',
 'worst symmetry',
 'worst fractal dimension']

In [14]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.3,
    random_state=2
)

In [15]:
X_test.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
mean radius,171.0,14.322164,3.833332,7.691,11.53,13.43,16.3,28.11
mean texture,171.0,19.004971,3.968417,10.82,16.27,18.75,21.255,31.12
mean perimeter,171.0,93.372749,26.451695,48.34,73.935,86.18,107.75,188.5
mean area,171.0,679.450292,396.025382,170.4,408.2,556.7,826.95,2499.0
mean smoothness,171.0,0.098051,0.013478,0.06251,0.088705,0.09723,0.1068,0.1425
mean compactness,171.0,0.106835,0.056154,0.01938,0.06654,0.09509,0.1314,0.3114
mean concavity,171.0,0.09316,0.087672,0.0,0.02992,0.0594,0.1389,0.4264
mean concave points,171.0,0.05245,0.042552,0.0,0.020495,0.03438,0.079845,0.1913
mean symmetry,171.0,0.183357,0.029648,0.106,0.16375,0.1802,0.19775,0.304
mean fractal dimension,171.0,0.062937,0.007085,0.05054,0.05828,0.06201,0.06603,0.09744


In [16]:
X_train.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
mean radius,398.0,14.043565,3.384142,6.981,11.7525,13.32,15.745,27.42
mean texture,398.0,19.41196,4.435474,9.71,16.1625,18.9,21.955,39.28
mean perimeter,398.0,91.36593,23.32234,43.79,75.5175,86.29,103.675,186.9
mean area,398.0,644.336432,331.143825,143.5,426.175,546.35,776.175,2501.0
mean smoothness,398.0,0.095634,0.014263,0.05263,0.08496,0.094625,0.104375,0.1634
mean compactness,398.0,0.103269,0.051346,0.0265,0.063945,0.092455,0.130375,0.3454
mean concavity,398.0,0.086926,0.076089,0.0,0.029565,0.06168,0.12705,0.4268
mean concave points,398.0,0.047402,0.037029,0.0,0.020323,0.033375,0.06825,0.2012
mean symmetry,398.0,0.180219,0.02638,0.1167,0.161525,0.17895,0.19455,0.2906
mean fractal dimension,398.0,0.062738,0.007058,0.04996,0.057563,0.0613,0.066143,0.09575


### Scaling data

$$ \frac{X-\mu}{\sigma}$$

In [17]:
scaler = StandardScaler().fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Without PCA

### 1. Modelling with KNN: k-nearest neighbors

"A man is known for the company he keeps"

<img src="https://s3.amazonaws.com/stackabuse/media/k-nearest-neighbors-algorithm-python-scikit-learn-2.png">

In [18]:
from sklearn.neighbors import KNeighborsClassifier

In [19]:
knn = KNeighborsClassifier()
knn.fit(X_train_scaled, y_train)

KNeighborsClassifier()

In [20]:
y_pred = knn.predict(X_test_scaled)

### 2. Evaluation

In [21]:
from sklearn.metrics import confusion_matrix

Confusion Matrix
![Screenshot%202021-07-20%20at%2012.56.29.png](attachment:Screenshot%202021-07-20%20at%2012.56.29.png)

# With PCA 

### 1. Applying PCA

In [22]:
from sklearn.decomposition import PCA

In [23]:
pca = PCA()
pca.fit(X_train_scaled)

PCA()

#### - Explained Variance Ratio

exp_var_pca = pca.explained_variance_ratio_

# Cumulative sum of eigenvalues; This will be used to create step plot
# for visualizing the variance explained by each principal component.

cum_sum_eigenvalues = np.cumsum(exp_var_pca)

# Create the visualization plot
plt.bar(range(0,len(exp_var_pca)), exp_var_pca, 
        alpha=0.5, align='center', label='Individual explained variance')
plt.step(range(0,len(cum_sum_eigenvalues)), cum_sum_eigenvalues, 
         where='mid',label='Cumulative explained variance')
plt.ylabel('Explained variance ratio')
plt.xlabel('Principal component index')
plt.legend(loc='best')
plt.tight_layout()
plt.show()

#### - Choose the number of dimensions

In [24]:
pca = PCA()
pca.fit(X_train_scaled)

PCA()

#### - Apply PCA on train and test data

In [25]:
X_train_pca = pca.transform(X_train_scaled)
X_test_pca = pca.transform(X_test_scaled)

### 2. Modelling with KNN: k-nearest neighbors

In [26]:
knn = KNeighborsClassifier()
knn.fit(X_train_scaled, y_train)

KNeighborsClassifier()

In [27]:
y_pred_pca = knn.predict(X_test_pca)

### 3. Evaluation and comparison with kNN without PCA

In [28]:
# with PCA
confusion_matrix(y_test, y_pred)


array([[ 64,   3],
       [  0, 104]])

In [29]:
# without PCA


### 4. PCA: data visualisation

### 5. PCA: correlation between features in the original space and between features in a new space

### 6. How the features mixed up to create the components