# **Feature Extraction and Dimensionality Reduction with Principal Component Analysis (PCA) and Comparison Accuracy 6 Machine Learning Classification Models: before-after PCA.**

Step 1:Collect Data: UCI Parkinson's Disease Classification Data Set

Step 2: Eigendecomposition - Eigenvalues, Eigenvectors and Eigenspace

Step 3: Primary Component Selection

Step 4: Projection New Feature Space

Step 5: Principal Component Analysis (PCA)

Step 6:  Comparison Accurancy 6 Machine Learning Models : before-after PCA

1. Model : Logistic Regression
2. Model : Support Vector Machines (SVM)
3. Model : Decision Tree Classifier
4. Model : KNN(k-nearest neighbors algorithm)
5. Model : Random Forest Classifier
6. Model: Gaussian Naive Bayes


In [None]:
import os 
os.chdir("../input/parkinsons-disease-speech-signal-features/")
!ls

# **Step 1:Collect Data: UCI Parkinson's Disease Classification Data Set**
https://archive.ics.uci.edu/ml/datasets/Parkinson%27s+Disease+Classification

In [None]:
# !wget https://archive.ics.uci.edu/ml/machine-learning-databases/00470/pd_speech_features.rar


In [None]:
# !unrar x pd_speech_features.rar

**Data Set Information:**

The data used in this study were gathered from 188 patients with PD (107 men and 81 women) with ages ranging from 33 to 87 (65.1Â±10.9) at the Department of Neurology in CerrahpaÅŸa Faculty of Medicine, Istanbul University. The control group consists of 64 healthy individuals (23 men and 41 women) with ages varying between 41 and 82 (61.1Â±8.9). During the data collection process, the microphone is set to 44.1 KHz and following the physicianâ€™s examination, the sustained phonation of the vowel /a/ was collected from each subject with three repetitions.


**Attribute Information:**

Various speech signal processing algorithms including Time Frequency Features, Mel Frequency Cepstral Coefficients (MFCCs), Wavelet Transform based Features, Vocal Fold Features and TWQT features have been applied to the speech recordings of Parkinson's Disease (PD) patients to extract clinically useful information for PD assessment.


**Relevant Papers:**

Provide references to papers that have cited this data set in the past (if any).

In [None]:
import pandas as pd
df = pd.read_csv("pd_speech_features.csv") # import dataset 

df

 Data Cleaning and Data Manipulation**

In [None]:
# df.columns = df.iloc[0]
# df = df.iloc[1:,].reindex()
# df

In [None]:
df.columns

In [None]:
df.info()

**Determining dependent and independent variables of the dataset**

In [None]:
X = df.iloc[:, 0:754].values  # select the independent variables
y = df.iloc[:, 754].values    # select the dependent variable and target column

**Data Standardization**


In [None]:
from sklearn.preprocessing import StandardScaler
X = StandardScaler().fit_transform(X)

# Step 2: Eigendecomposition - Eigenvalues, Eigenvectors and Eigenspace 
The eigenvectors and eigenvalues of a covariance (or correlation) matrix represent the "core" of a PCA: The eigenvectors (elementary components) determine the directions of the new feature space, and the eigenvalues determine their size. In other words, eigenvalues describe the variance of the data along the new feature axes. Covariance Matrix The classical approach to PCA is to perform eigende composition on the covariance matrix, which is a matrix in which each element represents the covariance between two features. The covariance between two properties is calculated as follows:

Cov(X,Y)=∑(xi−x¯)(yi−y¯)N−1


**Compute the covariance matrix**

In [None]:
import numpy as np

X_mean = np.mean(X, axis=0)
# cov_mat = np.cov(X) # another method 
cov_mat = (X - X_mean).T.dot((X - X_mean)) / (X.shape[0]-1)
print('Covariance matrix \n%s' %cov_mat)

**The second method for compute the covariance matrix**

In [None]:
print('NumPy covariance matrix: \n%s' %np.cov(X.T))

**Compute the Eigenvalues and Eigenvectors**
We make an identification on the covariance matrix: All three approaches yield the same eigenvectors and eigenvalue pairs: Identification of the covariance matrix after standardizing the data. Essence composition of the correlation matrix.

In [None]:
eigenvalues, eigenvectors = np.linalg.eig(cov_mat)

print('Eigenvectors \n%s' %eigenvectors[:5])
print('\nEigenvalues \n%s' %eigenvalues[:5])

In [None]:
len(eigenvalues)

# **Step 3: Primary Component Selection**
Sorting Eigenpairs (Sorting of self-pairs)
  The purpose of PCA is to reduce the dimensionality of the original feature space by projecting it into a smaller subspace where the eigenvectors will form the axes. However, the eigenvectors only describe the directions of the new axis, because they all have the same unit length 1.To decide which eigenvector (s) can be omitted without losing too much information, we need to examine the corresponding eigenvalues: Eigenvectors with the lowest eigenvalues carries little information; these can fall. The common approach is to order the eigenvalues from highest to lowest.

**Compute the variance of eigen values**
We select only first 6 features for this project

In [None]:
total_of_eigenvalues = sum(eigenvalues)
varariance = [(i / total_of_eigenvalues)*100 for i in sorted(eigenvalues, reverse=True)]

varariance[:50]

**As seen in the figure, the properties after 350 affect the target column by 0. These do not have any effect on the functioning of the model.**

In [None]:

import matplotlib.pyplot as plt

with plt.style.context('dark_background'):
    plt.figure(figsize=(15, 10))

    plt.bar(range(len(eigenvalues)), varariance, alpha=0.8, align='center',
            label='individual explained variance')
    plt.ylabel('Explained variance ratio')
    plt.xlabel('Principal components')
    plt.legend(loc='best')
    plt.tight_layout()

In [None]:
varariance[0]

**Projection Matrix** 
The projection matrix is used to transform the Input data (X) into the new property subspace. The Projection Matrix is a matrix of combined upper k eigenvectors. Here, we reduce the 4-dimensional feature space to a 2-dimensional feature subspace by selecting the "first 2" eigenvectors with the highest eigenvalues to construct our 2 dimensional eigenvector matrix.

In [None]:
eigenpairs = [(np.abs(eigenvalues[i]), eigenvectors[:,i]) for i in range(len(eigenvalues))]

# Sorting eigenvalues and eigenvectors from higher values to lower values
eigenpairs.sort(key=lambda x: x[0], reverse=True)

eigenpairs[0][0]

In [None]:
eigenpairs[5][1].shape

In [None]:
# only for 6 features 
matrix_weighing = np.hstack((eigenpairs[0][1].reshape(754,1),
                      eigenpairs[1][1].reshape(754,1),
                      eigenpairs[2][1].reshape(754,1),
                      eigenpairs[3][1].reshape(754,1),
                      eigenpairs[4][1].reshape(754,1),
                      eigenpairs[5][1].reshape(754,1)))
matrix_weighing

# **Step 4: Projection in a New Feature Space**

Projection into the New Feature Space In this last step, we will use the 754 × 6 dimensional projection matrix W to transform our samples into the new hexahedron through the equation Y = X × W.

In [None]:
Y = X.dot(matrix_weighing)
Y.shape

In [None]:
df["class"].unique()

In [None]:
import matplotlib.pyplot as plt

with plt.style.context('seaborn-whitegrid'):
    plt.figure(figsize=(6, 4))
    for lab, col in zip(('0', '1'), ('red', 'green')):
        plt.scatter(Y[y==lab, 0], Y[y==lab, 1], label=lab, c=col)
    plt.xlabel('Principal Component 1')
    plt.ylabel('Principal Component 2')
    plt.legend(loc='lower center')
    plt.tight_layout();
    plt.show();

# **Step 5: Principal Component Analysis (PCA)**

In [None]:
from sklearn.decomposition import PCA
pca = PCA().fit(X)
plt.plot(np.cumsum(pca.explained_variance_ratio_))
plt.xlim(0,754,1)
plt.grid()
plt.xlabel('Number of components')
plt.ylabel('Cumulative explained variance')

**Division of training and test data**

In [None]:
# eğitim ve test kümelerinin bölünmesi
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

**Implementing Standard scaling data**

In [None]:
# Standard scaler haline getirme verileri
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
print("X_train shape = ",X_train.shape)
print("X_test shape = ",X_test.shape)

**Preparation of new data set to be used in training models. Principal Component Analysis(PCA) implementation. Feature extraction of the data set. And the size reduction has been done.**

#**Note: ** I chose the (n_components)top 6 components with the highest variance. anyone can give a different number. It is an optional choice. Decide to process only 6 of the 754 features with the highest variance. It reduces the size very much and enables fast processing and only the most effective features will be processed. 

**PCA enabled only 6 variables to be processed instead of 754 variables.**

In [None]:
# PCA
from sklearn.decomposition import PCA
pca = PCA(n_components = 6)

X_train2 = pca.fit_transform(X_train) # sadece bir tane PCA ile çalışıyor aynı uzayda olması için
X_test2 = pca.transform(X_test)       # test verisini eğitmiyoruz sadece transform uyguluyoruz

print("X_train2 shape = ",X_train2.shape)
print("X_test2 shape = ",X_test2.shape)


# **Step 6: Comparison Accurancy 6 Machine Learning Models : before-after PCA**

## **1. Model : Logistic Regression**

**Before PCA**

In [None]:
#pca dönüşümünden önce gelen Logistic regression
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression(random_state=0)
classifier.fit(X_train,y_train)

**After PCA**

In [None]:

#pca dönüşümünden sonra gelen LR
classifier2 = LogisticRegression(random_state=0)
classifier2.fit(X_train2,y_train)

**Success comparison of PCA and non-PCA models**

In [None]:
#Predictions : tahminler
y_pred = classifier.predict(X_test)    # without PCA
y_pred2 = classifier2.predict(X_test2) # after PCA

**Comparison between real and before PCA**

In [None]:
from sklearn.metrics import confusion_matrix
from sklearn import neighbors, datasets, preprocessing
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report

#actual / PCA olmadan çıkan sonuç
print("Comparison between real and before PCA")

print('Accuracy Score:', accuracy_score(y_test, y_pred))
print('Confusion matrix \n',  confusion_matrix(y_test, y_pred))
print('Classification \n', classification_report(y_test, y_pred))

**Comparison between real and before PCA**

In [None]:
#actual / PCA sonrası çıkan sonuç
print("Comparison between real and after PCA ")

print('Accuracy Score:', accuracy_score(y_test, y_pred2))
print('Confusion matrix \n',  confusion_matrix(y_test, y_pred2))
print('Classification \n', classification_report(y_test, y_pred2))



## **2. Model : Support Vector Machines (SVM)**

**Comparison between real and before PCA**

In [None]:

#Support Vector Machine
from sklearn.svm import SVC
 

classifier = SVC()
classifier.fit(X_train,y_train)
y_pred = classifier.predict(X_test)
cm = confusion_matrix(y_test,y_pred)
accuracy = accuracy_score(y_test,y_pred)
print("Support Vector Machine:")

print("Comparison between real and before PCA")

print('Accuracy Score:', accuracy_score(y_test, y_pred))
print('Confusion matrix \n',  confusion_matrix(y_test, y_pred))
print('Classification \n', classification_report(y_test, y_pred))

**Comparison between real and after PCA**

In [None]:

#Support Vector Machine
from sklearn.svm import SVC
 

classifier = SVC()
classifier.fit(X_train2,y_train)
y_pred2 = classifier.predict(X_test2)
cm = confusion_matrix(y_test,y_pred2)
accuracy = accuracy_score(y_test,y_pred2)
print("Support Vector Machine:")

print("Comparison between real and after PCA")

print('Accuracy Score:', accuracy_score(y_test, y_pred2))
print('Confusion matrix \n',  confusion_matrix(y_test, y_pred2))
print('Classification \n', classification_report(y_test, y_pred2))

## **3. Model : Decision Tree Classifier**

**Comparison between real and before PCA**

In [None]:

from sklearn.model_selection import train_test_split


from sklearn.tree import DecisionTreeClassifier as DT
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score

classifier = DT(criterion='entropy', random_state=0)
classifier.fit(X_train,y_train)
y_pred = classifier.predict(X_test)

print("Decision Tree Classifier :")

print("Comparison between real and before PCA")

print('Accuracy Score:', accuracy_score(y_test, y_pred))
print('Confusion matrix \n',  confusion_matrix(y_test, y_pred))
print('Classification \n', classification_report(y_test, y_pred))

**Comparison between real and after PCA**

In [None]:

from sklearn.model_selection import train_test_split


from sklearn.tree import DecisionTreeClassifier as DT
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score

classifier = DT(criterion='entropy', random_state=0)
classifier.fit(X_train2,y_train)
y_pred2 = classifier.predict(X_test2)


print("Comparison between real and after PCA")

print('Accuracy Score:', accuracy_score(y_test, y_pred2))
print('Confusion matrix \n',  confusion_matrix(y_test, y_pred2))
print('Classification \n', classification_report(y_test, y_pred2))

## **4. Model : KNN(k-nearest neighbors algorithm)**

**Comparison between real and before PCA**

In [None]:

from sklearn import neighbors, datasets, preprocessing
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
knn = neighbors.KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)

print("K-Neighbors Classifier :")

print("Comparison between real and before PCA")

print('Accuracy Score:', accuracy_score(y_test, y_pred))
print('Confusion matrix \n',  confusion_matrix(y_test, y_pred))
print('Classification \n', classification_report(y_test, y_pred))

**Comparison between real and after PCA**

In [None]:
knn = neighbors.KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train2, y_train)
y_pred2 = knn.predict(X_test2)

print("Comparison between real and after PCA")
print('Accuracy Score:', accuracy_score(y_test, y_pred2))
print('Confusion matrix \n',  confusion_matrix(y_test, y_pred2))
print('Classification \n', classification_report(y_test, y_pred2))

## **5. Model : Random Forest Classifier**

**Comparison between real and before PCA**

In [None]:

from sklearn.ensemble import RandomForestClassifier as RF

classifier = RF(n_estimators=10, criterion='entropy', random_state=0)
classifier.fit(X_train,y_train)
y_pred = classifier.predict(X_test)

print("Random Forest Classifier :")

print("Comparison between real and before PCA")
print('Accuracy Score:', accuracy_score(y_test, y_pred))
print('Confusion matrix \n',  confusion_matrix(y_test, y_pred))
print('Classification \n', classification_report(y_test, y_pred))

**Comparison between real and after PCA**

In [None]:

from sklearn.ensemble import RandomForestClassifier as RF

classifier = RF(n_estimators=10, criterion='entropy', random_state=0)
classifier.fit(X_train2,y_train)
y_pred2 = classifier.predict(X_test2)

print("Comparison between real and after PCA")
print('Accuracy Score:', accuracy_score(y_test, y_pred2))
print('Confusion matrix \n',  confusion_matrix(y_test, y_pred2))
print('Classification \n', classification_report(y_test, y_pred2))

## ** 6. Model:  Gaussian Naive Bayes**

**Comparison between real and before PCA**

In [None]:

#Gaussian Naive Bayes
from sklearn.naive_bayes import GaussianNB
classifier = GaussianNB()
classifier.fit(X_train,y_train)
y_pred = classifier.predict(X_test)

print("Gaussian Naive Bayes :")

print("Comparison between real and before PCA")
print('Accuracy Score:', accuracy_score(y_test, y_pred))
print('Confusion matrix \n',  confusion_matrix(y_test, y_pred))
print('Classification \n', classification_report(y_test, y_pred))

**Comparison between real and after PCA**

In [None]:

from sklearn.naive_bayes import GaussianNB
classifier = GaussianNB()
classifier.fit(X_train2,y_train)
y_pred2 = classifier.predict(X_test2)

print("Comparison between real and after PCA")
print('Accuracy Score:', accuracy_score(y_test, y_pred2))
print('Confusion matrix \n',  confusion_matrix(y_test, y_pred2))
print('Classification \n', classification_report(y_test, y_pred2))

**It is observed that reducing the size with PCA, that is, reducing the number of variables, has a positive effect on the success score of some machine learning classification models. It is possible to produce more effective and faster solutions by taking a small amount of data loss. Reducing dimensions with PCA will provide us with great convenience, especially in studies related to Big Data.**