## Diego Orejuela
## Machine Learning Project

## Perform dimensionality reduction using PCA, LDA, and Kernel PCA.
Dataset: breast cancer dataset that scikit learn provides.
* Number of Instances: 569
* Number of Attributes: 30
* Number of classes: 2
* Read user guide to learn about the dataset. The dataset can be accessed and loaded by doing:
    
    from sklearn.datasets import load_breast_cancer 
    
    df = load_breast_cancer()


## Load the dataset and split it into a training set (70%) and a test set (30%).

In [1]:
#Loading the dataset
from sklearn.datasets import load_breast_cancer 
dataset = load_breast_cancer()

X = dataset.data
y = dataset.target

#split data into Training and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 0)

#Normalizing the features
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)

## Train Logistic Regression on the dataset and time how long it takes.

In [2]:
import time
start_time_lr = time.time()

# Fitting Logistic Regression to Training Set
from sklearn.linear_model import LogisticRegression
classifierObj = LogisticRegression(random_state=0)
classifierObj.fit(X_train, y_train)

# Execution time of logistic regression
elapsed_time_lr = time.time() - start_time_lr
print("Logistic Regression training took: ", elapsed_time_lr, "seconds \n")


Logistic Regression training took:  0.7268662452697754 seconds 



## Evaluate the resulting model on the test set.

In [3]:
# Making Predictions on the Test Set
y_pred_lr = classifierObj.predict(X_test)

#Evaluate predictions using a Confusion Matrix
from sklearn.metrics import confusion_matrix
cm_lr = confusion_matrix(y_test, y_pred_lr)
print("Confussion Matrix \n", cm_lr, "\n")

#model accuracy
lr_ModAcc = ((cm_lr.diagonal().sum()/cm_lr.sum()))
print('Model Accuracy: ', lr_ModAcc*100)

#misclassification rate
lr_MiscRate = 1- lr_ModAcc
print("Misclassification Rate: ", lr_MiscRate*100)

Confussion Matrix 
 [[ 60   3]
 [  1 107]] 

Model Accuracy:  97.6608187134503
Misclassification Rate:  2.3391812865497075


## Use PCA to reduce the dataset’s dimensionality, with an explained variance ratio of at least 95%.

In [4]:
#Applying PCA
from sklearn.decomposition import PCA

# initially choose None to not exclude any values
pcaObj = PCA(n_components = None) 

X_train_PCA = pcaObj.fit_transform(X_train)
X_test_PCA = pcaObj.transform(X_test)
# explained_variance_ratio_ returns a sorted array on the variance of each feature
components_variance_PCA = pcaObj.explained_variance_ratio_

#variance captured with chosen number of components
print('Explained variance ratio of at least 95%:')
print(sum(components_variance_PCA[0:10]*100))

#Applying PCA with 95% explained variance ratio, 10 components
pcaObj = PCA(n_components = 10) 
X_train_PCA = pcaObj.fit_transform(X_train)
X_test_PCA = pcaObj.transform(X_test)
components_variance_PCA = pcaObj.explained_variance_ratio_
components_variance_PCA

Explained variance ratio of at least 95%:
95.14149471124793


array([0.43689315, 0.19415163, 0.09661545, 0.06716611, 0.0549883 ,
       0.04012257, 0.02183068, 0.01489226, 0.01374108, 0.01101371])

## Train a new Logistic Regression classifier on the PCA reduced dataset and time how long it takes. Was training much faster?

In [5]:
import time
start_time_PCA = time.time()

classifierObj_PCA = LogisticRegression(random_state=0)
classifierObj_PCA.fit(X_train_PCA, y_train)

# Execution time of logistic regression
elapsed_time_PCA = time.time() - start_time_PCA
print("Logistic Regression with PCA took: ", elapsed_time_PCA , " seconds \n")

Logistic Regression with PCA took:  0.002975940704345703  seconds 



## Next evaluate the classifier on the test set: how does it compare to the previous classifier?

In [6]:
#Making Predictions on the Test Set
y_pred_PCA = classifierObj_PCA.predict(X_test_PCA)

#Evaluate predictions using a Confusion Matrix
from sklearn.metrics import confusion_matrix
cm_PCA = confusion_matrix(y_test, y_pred_PCA)
print("Confussion Matrix \n", cm_PCA, "\n")

PCA_ModAcc = ((cm_PCA.diagonal().sum()/cm_PCA.sum()))
print('Model Accuracy: ', PCA_ModAcc*100)
#misclassification rate
PCA_MiscRate = 1- PCA_ModAcc
print("Misclassification Rate: ", PCA_MiscRate*100)

Confussion Matrix 
 [[ 60   3]
 [  4 104]] 

Model Accuracy:  95.90643274853801
Misclassification Rate:  4.093567251461994


## Use LDA to reduce the dataset’s dimensionality down to 2 linear discriminants.

In [7]:
#Applying LDA
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
ldaObj = LDA(n_components=2) 
# fit_transform method intakes both X_train and y_train since LDA is unsupervised
X_train_LDA = ldaObj.fit_transform(X_train, y_train)
X_test_LDA = ldaObj.transform(X_test)

## Train a new Logistic Regression classifier on the LDA reduced dataset and time how long it takes.

In [8]:
start_time_LDA = time.time()
# Fitting Logistic Regression to Training Set
classifierObj_LDA = LogisticRegression(random_state=0)
classifierObj_LDA.fit(X_train_LDA, y_train)

# Execution time of logistic regression
elapsed_time_LDA = time.time() - start_time_LDA
print("Logistic Regression with LDA took: ", elapsed_time_LDA, "seconds \n")

Logistic Regression with LDA took:  0.0023632049560546875 seconds 



## Evaluate the classifier on the test set.

In [9]:
#Making Predictions on the Test Set
y_pred_LDA = classifierObj_LDA.predict(X_test_LDA)

#Evaluate predictions using a Confusion Matrix
from sklearn.metrics import confusion_matrix
cm_LDA = confusion_matrix(y_test, y_pred_LDA)
print("Confussion Matrix \n", cm_LDA, "\n")

LDA_ModAcc = ((cm_LDA.diagonal().sum()/cm_LDA.sum()))
print('Model Accuracy: ', LDA_ModAcc*100)
#misclassification rate
LDA_MiscRate = 1- LDA_ModAcc
print("Misclassification Rate: ", LDA_MiscRate*100)

Confussion Matrix 
 [[ 59   4]
 [  2 106]] 

Model Accuracy:  96.49122807017544
Misclassification Rate:  3.508771929824561


## Use Kernel PCA to reduce the dataset’s dimensionality down to 2 features.

In [10]:
# kernelPCA
from sklearn.decomposition import KernelPCA
kernelPCAObj = KernelPCA(n_components=2, kernel='rbf')
X_train_kPCA = kernelPCAObj.fit_transform(X_train)
X_test_kPCA = kernelPCAObj.transform(X_test)

## Train a new Logistic Regression classifier on the Kernel PCA reduced dataset and time how long it takes.

In [11]:
start_time_kPCA = time.time()

# Fitting Logistic Regression to Training Set
classifierObj_kPCA = LogisticRegression(random_state=0)
classifierObj_kPCA.fit(X_train_kPCA, y_train)

# Execution time of logistic regression
elapsed_time_kPCA = time.time() - start_time_kPCA
print("Logistic Regression with PCA took: ", elapsed_time_kPCA, " seconds \n")


Logistic Regression with PCA took:  0.0019910335540771484  seconds 



## Evaluate the classifier on the test set.

In [12]:
#Making Predictions on the Test Set
y_pred_kPCA = classifierObj_kPCA.predict(X_test_kPCA)

#Evaluate predictions using a Confusion Matrix
from sklearn.metrics import confusion_matrix
cm_kPCA = confusion_matrix(y_test, y_pred_kPCA)
print("Confussion Matrix \n", cm_kPCA, "\n")

kPCA_ModAcc = ((cm_kPCA.diagonal().sum()/cm_kPCA.sum()))
print('Model Accuracy: ', kPCA_ModAcc*100)
#misclassification rate
kPCA_MiscRate = 1- kPCA_ModAcc
print("Misclassification Rate: ", kPCA_MiscRate*100)

Confussion Matrix 
 [[56  7]
 [10 98]] 

Model Accuracy:  90.05847953216374
Misclassification Rate:  9.941520467836263


## Conclusions:

* Logistic Regression without any dimensionality reduction technique was the most accurate predictor with 2.4% misclassification rate, but it was training took longer
* Although The execution time of the PCA reduced dataset was much faster than the dataset without any dimensionality reduction technique applied, the misclassification rate increased to 4%
* LDA reduction trained the model in a similar amount of time as PCA and resulted a 3.5% misclassification rate
* Kernel PCA reduction had the misclassification rate increase to 9.9%
* PCA, LDA and Kernel PCA had a similar amount of execution time