# Principal Component Analysis using the Cancer Data

## Introduction
PCA is used to explain the variance-covariance structure of the attributes or variables or features in a dataset using a few linear combinations of those variables. This analysis serves as an intermediate step to much large objectives such as classification or regression analysis or cluster analysis or factor analysis. 

This assignment walks through doing a principal component analysis using numpy library functions. There are many other libraries in Python that might achieve the same , one such instance is provided towards the end. 

In [1]:
# By default in a Jupyter notebook, a cell with multiple print commands, when run, would print only the last one. 
# This piece of code would modify that to print all the relevant lines in the cell.  
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

In [2]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from pandas.plotting import scatter_matrix

## Loading the data

To illustrate PCA we will use the Breast Cancer data available here https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/

The data was collected in 1995 and has 569 observations and each observation has 30 features or attributes. There are no missing attribute values. There are two classes : Malignant (38%) and Benign ( 62%).

You can read more about the data https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic)

For this analysis, the files used are wdbc.data (and wdbc.names for the column headers)

In [None]:
#load the data and quickly assess if it was loaded correctly by printing the first/last few rows

missing_value_formats = ["n.a.","?","NA","n/a", "na", "--", " "]
colnames = ['ID', 'Diagnosis', 'mean radius', 'mean texture', 'mean perimeter', 'mean area',
       'mean smoothness', 'mean compactness', 'mean concavity',
       'mean concave points', 'mean symmetry', 'mean fractal dimension',
       'radius error', 'texture error', 'perimeter error', 'area error',
       'smoothness error', 'compactness error', 'concavity error',
       'concave points error', 'symmetry error',
       'fractal dimension error', 'worst radius', 'worst texture',
       'worst perimeter', 'worst area', 'worst smoothness',
       'worst compactness', 'worst concavity', 'worst concave points',
       'worst symmetry', 'worst fractal dimension']
df_raw = pd.read_csv('wdbc.data', header = None, names = colnames, na_values = missing_value_formats)
df_raw.shape
df_raw.head(5)
df_raw.tail(5)

## EDA

### Descriptive Statistics
In case there are missing values, outliers, format mismatches, erroneous observations, additional attributes not relevant for the analysis etc., we might have to redo this section multiple times till we achieve a level of condifence in the accuracy of the dataset. This dataset does not have any missing values or anomolous observations of concern.

Lets compute summary statistics and information on data types.

In [None]:
df_raw.describe()

In [None]:
df_raw.info()

### Data visualization

This section is to illustrate *some*  of the data visualization techniques that might help us to judge if PCA would be helpful for this data set or not. 

The first is to check if the 30 or so attributes have visual variability between the two categories of the cancer. This is a quick check since this data set has been used for classification of tumor into categories malignant or benign, using features of the tumor. So we would want to check if there are differences in the attributes for the two categories (we will talk more about it during Regression Analysis later in the course). 

The second is to check if there are correlations between the attributes or variables. If the variables are independent (aka, correlation is 0 or close to 0), principal components would just be the attributes and would offer no dimension reduction.

In [None]:
# Reference: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.boxplot.html
# Please note the hardcoded values "-2" in the range and "+2" in the indexing the colnames are to ensure 
# that we do not the plot for the first two columns , ID and Diagnosis

for i in range(df_raw.shape[1]-2):
    df_raw.boxplot(column=colnames[i+2] , by=['Diagnosis'], figsize=(12,9))


In [None]:
# Reference : https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.plotting.scatter_matrix.html

# Not keen on seeing the scatter plot for ID and Diagnosis
scatter_matrix(df_raw.iloc[:,2:32], alpha=0.9, figsize=(50, 50), diagonal='kde')

#### Observation
Seems like quite a few variables may have variability to differentiate between the two categories and many of them are correlated. There is a scope for reducing the dimension of the attribute set from 30 to a smaller number.

### Dataset preparation
Since PCA is an unsupervised approach (the analysis does not depend on the labels or categories or response to predict), we would want to remove the ID and the Diagnosis columns

In [None]:
df = df_raw.iloc[:,2:32]

## Principal Component Analysis
The analysis consists of five main steps:
1. Computing the covariance matrix. Covariance matrix is a square symmetric matrix that captures the variance of the attributes on the diagonal and pair-wise covariances among attributes on the off diagonal. When the covariance values are scaled by the variance on the diagonal and product of the individual standard deviations on the off diagnoal, the resultant matrix is known as Correlation matrix. It is also same as computing the covariance of standardized (centered and scaled, i.e. subtract mean and divide by standard deviation) data matrix
2. Factoring the covariance (or correlation) matrix into eigenvalues and eigenvectors
3. Scree plot. It is a plot of eigenvalue order number vs proportion of variance explained by the eigenvalue. It is a useful visual way to to determine how many components should be selected. To determine the number of compnents we look for an bend in the scree plot.
4. Computing the principal components  
5. Interpretation



### Covariance Matrix

In [None]:
df_cov = df.cov()
df_cov

In [None]:
# Reference: https://numpy.org/doc/stable/reference/routines.linalg.html
    
w, v = np.linalg.eig(df_cov)

print(w)
#print(v)

In [None]:
PVE = w/w.sum()
PVE
np.cumsum(PVE)

fig = plt.figure(figsize=(8,5))

plt.plot(PVE, linewidth=2)
plt.title('Scree Plot')
plt.xlabel('Principal Component')
plt.ylabel('Eigenvalue')

plt.show()

In [None]:
print(v[:,1])

##### First Principal Component

(0.00929)*mean radius + (-0.00288)*mean texture + (0.06275)*mean perimeter + (0.85182)*mean area + (-0.00001)*mean smoothness + (-0.00000)*mean compactness + (0.00008)*mean concavity + (0.00005)*mean concave points + (-0.00003)*mean symmetry + (-0.00002)*mean fractal dimension + (-0.00005)*radius error + (0.00035)*texture error + (0.00082)*perimeter error + (0.00751)*area error + (0.00000)*smoothness error + (0.00001)*compactness error + (0.00003)*concavity error + (0.00001)*concave points error + (0.00001)*symmetry error + (0.00000)*fractal dimension error + (-0.00057)*worst radius + (-0.01322)*worst texture + (-0.00019)*worst perimeter + (-0.51974)*worst area + (-0.00008)*worst smoothness + (-0.00026)*worst compactness + (-0.00018)*worst concavity + (-0.00003)*worst concave points + (-0.00016)*worst symmetry + (-0.00006)*worst fractal dimension

##### Observation
From the Scree plot (eigenvalue vs PVE), we can visually deduce that the first component would capture about 98% of the variability, but when we take a closer look at the first principal component, we quickly realize it is dominated by variables with large variances and magnitudes, 'mean area' and 'worst area'. We can correct this by considering the correlation matrix for computing the principal components.

### Correlation Matrix (or Covariance matrix of Standardized Data)

In [None]:
df_corr = df.corr()
df_corr

In [None]:
w_, v_ = np.linalg.eig(df_corr)

print(w_)
#print(v_)

In [None]:
#Determing the number of principal components
PVE_ = w_/w_.sum()
PVE_
np.cumsum(PVE_)

fig = plt.figure(figsize=(8,5))

plt.plot(PVE_, linewidth=2)
plt.title('Scree Plot')
plt.xlabel('Principal Component')
plt.ylabel('Eigenvalue')
plt.show()

In [None]:
print(v[:,1:14])

##### First few principal components
First Principal Component:
(-0.23386)*mean radius + (-0.05971)*mean texture + (-0.21518)*mean perimeter + (-0.23108)*mean area + (0.18611)*mean smoothness + (0.15189)*mean compactness + (0.06017)*mean concavity + (-0.03477)*mean concave points + (0.19035)*mean symmetry + (0.36658)*mean fractal dimension + (-0.10555)*radius error + (0.08998)*texture error + (-0.08946)*perimeter error + (-0.15229)*area error + (0.20443)*smoothness error + (0.23272)*compactness error + (0.19721)*concavity error + (0.13032)*concave points error + (0.18385)*symmetry error + (0.28009)*fractal dimension error + (-0.21987)*worst radius + (-0.04547)*worst texture + (-0.19988)*worst perimeter + (-0.21935)*worst area + (0.17230)*worst smoothness + (0.14359)*worst compactness + (0.09796)*worst concavity + (-0.00826)*worst concave points + (0.14188)*worst symmetry + (0.27534)*worst fractal dimension

Second Principal Component:
(-0.00853)*mean radius + (0.06455)*mean texture + (-0.00931)*mean perimeter + (0.02870)*mean area + (-0.10429)*mean smoothness + (-0.07409)*mean compactness + (0.00273)*mean concavity + (-0.02556)*mean concave points + (-0.04024)*mean symmetry + (-0.02257)*mean fractal dimension + (0.26848)*radius error + (0.37463)*texture error + (0.26665)*perimeter error + (0.21601)*area error + (0.30884)*smoothness error + (0.15478)*compactness error + (0.17646)*concavity error + (0.22466)*concave points error + (0.28858)*symmetry error + (0.21150)*fractal dimension error + (-0.04751)*worst radius + (-0.04230)*worst texture + (-0.04855)*worst perimeter + (-0.01190)*worst area + (-0.25980)*worst smoothness + (-0.23608)*worst compactness + (-0.17306)*worst concavity + (-0.17034)*worst concave points + (-0.27131)*worst symmetry + (-0.23279)*worst fractal dimension

Third Principal Component:
(0.04141)*mean radius + (-0.60305)*mean texture + (0.04198)*mean perimeter + (0.05343)*mean area + (0.15938)*mean smoothness + (0.03179)*mean compactness + (0.01912)*mean concavity + (0.06534)*mean concave points + (0.06712)*mean symmetry + (0.04859)*mean fractal dimension + (0.09794)*radius error + (-0.35986)*texture error + (0.08899)*perimeter error + (0.10821)*area error + (0.04466)*smoothness error + (-0.02747)*compactness error + (0.00132)*concavity error + (0.07407)*concave points error + (0.04407)*symmetry error + (0.01530)*fractal dimension error + (0.01542)*worst radius + (-0.63281)*worst texture + (0.01380)*worst perimeter + (0.02589)*worst area + (0.01765)*worst smoothness + (-0.09133)*worst compactness + (-0.07395)*worst concavity + (0.00601)*worst concave points + (-0.03625)*worst symmetry + (-0.07705)*worst fractal dimension

Fourth Principal Component:
(-0.03779)*mean radius + (0.04947)*mean texture + (-0.03737)*mean perimeter + (-0.01033)*mean area + (0.36509)*mean smoothness + (-0.01170)*mean compactness + (-0.08638)*mean concavity + (0.04386)*mean concave points + (0.30594)*mean symmetry + (0.04442)*mean fractal dimension + (0.15446)*radius error + (0.19165)*texture error + (0.12099)*perimeter error + (0.12757)*area error + (0.23207)*smoothness error + (-0.27997)*compactness error + (-0.35398)*concavity error + (-0.19555)*concave points error + (0.25287)*symmetry error + (-0.26330)*fractal dimension error + (0.00441)*worst radius + (0.09288)*worst texture + (-0.00745)*worst perimeter + (0.02739)*worst area + (0.32444)*worst smoothness + (-0.12180)*worst compactness + (-0.18852)*worst concavity + (-0.04333)*worst concave points + (0.24456)*worst symmetry + (-0.09442)*worst fractal dimension


Fifth Principal Component:
(0.01874)*mean radius + (-0.03218)*mean texture + (0.01731)*mean perimeter + (-0.00189)*mean area + (-0.28637)*mean smoothness + (-0.01413)*mean compactness + (-0.00934)*mean concavity + (-0.05205)*mean concave points + (0.35646)*mean symmetry + (-0.11943)*mean fractal dimension + (-0.02560)*radius error + (-0.02875)*texture error + (0.00181)*perimeter error + (-0.04286)*area error + (-0.34292)*smoothness error + (0.06920)*compactness error + (0.05634)*concavity error + (-0.03122)*concave points error + (0.49025)*symmetry error + (-0.05320)*fractal dimension error + (-0.00029)*worst radius + (-0.05001)*worst texture + (0.00850)*worst perimeter + (-0.02516)*worst area + (-0.36926)*worst smoothness + (0.04771)*worst compactness + (0.02838)*worst concavity + (-0.03087)*worst concave points + (0.49893)*worst symmetry + (-0.08022)*worst fractal dimension

##### Observation
For standardized data, the variables contribute equally to the principal components determined from the correlation matrix. We can successfully replace the original data matrix with the first 14 prinicpal components cumulatively explaining 98% of the total sample variance. Hence we have reduced the dimensions from 30 attributes to 14 with little loss of information.

Regarding interpretation, we can deduce by looking at the weights that the first principal seems to be a contrast of attributes capturing the size versus other features such as smoothness, compactness, symmetry etc. The second principal component could be interpreted as weighted difference of "error" versus the "worst" aspect of the attributes. The third principal component is capturing the "texture" related features.  

Principal components derived from covariance matrix are different from the ones derived from the correlation matrix. If the original dataset has attrubutes on varying scales, it is recommended to standardized the data matrix or use correlation matrix.

## Alternate Libraries

This data is also available as part of the Scikit-Learn library
Scikit-Learn is machine learning library available in python. As part of the library there are seven datasets provided to learn the various approaches. You can read about the different datasets and tutorial on loading them here
https://towardsdatascience.com/how-to-use-scikit-learn-datasets-for-machine-learning-d6493b38eca3


For additional reading on scikit-learn: https://scikit-learn.org/stable/index.html

In [None]:
import sklearn.datasets
from sklearn import preprocessing
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

In [None]:
cancer = sklearn.datasets.load_breast_cancer()
# the object returned is a bunch object , a dictionary like object with keys as attributes
cancer.data.shape
colnames = cancer.feature_names
print(cancer.DESCR)

#Convert the format bunch to a pandas dataframe for ease of mani
df = pd.DataFrame(cancer.data, columns=cancer.feature_names)
df['target'] = cancer.target

In [None]:
scaler = preprocessing.StandardScaler().fit(df)
df_scaled = pd.DataFrame(scaler.transform(df))

#df_scaled.mean()
#df_scaled.std()

# Standardizing data matrix
df_stand = pd.DataFrame(preprocessing.StandardScaler().fit_transform(df))
#df_stand.cov()

In [None]:
pca=PCA(n_components=30) 
pca.fit(df_scaled) 
X_pca=pca.transform(df_scaled) 
#let's check the shape of X_pca array
print("shape of X_pca", X_pca.shape)
print(np.cumsum(pca.explained_variance_ratio_))

# Reference: https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html