## PCA and LDA

### Dataset : Pima Indians Diabetes Database

This dataset is taken originally from the National Institute of Diabetes, Digestive and Kidney Diseases.

Diabetes is one of the fastest growing chronic life threatening diseases that have already affected 422 million people worldwide according to the report of World Health Organization (WHO), in 2018. Due to the presence of a relatively long asymptomatic phase, early detection of diabetes is always desired for a clinically meaningful outcome. Around 50% of all people suffering from diabetes are undiagnosed because of its long-term asymptomatic phase.

The objective of the dataset is to diagnostically predict whether or not a patient has diabetes, based on certain diagnostic measurements included in the dataset. Several constraints were placed on the selection of these instances from a larger database. In particular, all patients here are females at least 21 years old of Pima Indian heritage.

The dataset consists of several medical predictor variables and one target variable, Outcome. Predictor variables includes the number of pregnancies the patient has had, their BMI, insulin level, age, and so on. 

There are 768 observations and 8 independent variables in the dataset. The target variable indicates the test result of the patient. It is 1 when the test result is positive and 0 when the test result is negative. 

### Importing the necessary libraries

In [1]:
import pandas as pd
import numpy as np
import missingno as mno
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

#ignore warning messages 
import warnings
warnings.filterwarnings('ignore') 

### Loading the dataset

In [2]:
location=("C:\\Users\\krishna meghana\\Downloads\\diabetes.csv")
data = pd.read_csv(location)

In [3]:
data.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [4]:
print("The shape of the data is ",data.shape)

The shape of the data is  (768, 9)


In [5]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Pregnancies               768 non-null    int64  
 1   Glucose                   768 non-null    int64  
 2   BloodPressure             768 non-null    int64  
 3   SkinThickness             768 non-null    int64  
 4   Insulin                   768 non-null    int64  
 5   BMI                       768 non-null    float64
 6   DiabetesPedigreeFunction  768 non-null    float64
 7   Age                       768 non-null    int64  
 8   Outcome                   768 non-null    int64  
dtypes: float64(2), int64(7)
memory usage: 54.1 KB


The diabetes dataset contains 768 rows and 9 features including the target variable in which all are numeric in nature. No categorical features exist. But there are some features which can be turned into categorical, like Glucose level, BMI, Age groups.

Outcome is the target variable which has values 0 and 1 indicating whether the patient has diabetes or not.

#### Description of the Attributes:

- Pregnancies --- Number of times pregnant
- Glucose --- The blood plasma glucose concentration after a 2 hour oral glucose tolerance test (mg/dL)
- BloodPressure --- Diastolic blood pressure (mm/Hg)
- SKinThickness --- Skinfold Triceps skin fold thickness (mm)
- Insulin --- 2 Hour serum insulin (mu U/ml)
- BMI --- Body mass index (weight in kg/(height in m)^2)
- DiabetesPedigreeFunction --- A function that determines the risk of type 2 diabetes based on family history, the larger the function, the higher the risk of type 2 diabetes.
- Age 
- Outcome --- Whether the person is diagnosed with type 2 diabetes (1 = yes, 0 = no)

The dataset have nine attributes(parameters) in which there are eight independent variables (Pregnancies,Glucose,Blood Pressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age) and one dependent variable (Outcome)

### Checking for missing values

In [6]:
data.isna().sum()

Pregnancies                 0
Glucose                     0
BloodPressure               0
SkinThickness               0
Insulin                     0
BMI                         0
DiabetesPedigreeFunction    0
Age                         0
Outcome                     0
dtype: int64

In [7]:
data.describe().transpose()

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Pregnancies,768.0,3.845052,3.369578,0.0,1.0,3.0,6.0,17.0
Glucose,768.0,120.894531,31.972618,0.0,99.0,117.0,140.25,199.0
BloodPressure,768.0,69.105469,19.355807,0.0,62.0,72.0,80.0,122.0
SkinThickness,768.0,20.536458,15.952218,0.0,0.0,23.0,32.0,99.0
Insulin,768.0,79.799479,115.244002,0.0,0.0,30.5,127.25,846.0
BMI,768.0,31.992578,7.88416,0.0,27.3,32.0,36.6,67.1
DiabetesPedigreeFunction,768.0,0.471876,0.331329,0.078,0.24375,0.3725,0.62625,2.42
Age,768.0,33.240885,11.760232,21.0,24.0,29.0,41.0,81.0
Outcome,768.0,0.348958,0.476951,0.0,0.0,0.0,1.0,1.0


#### There are no missing values in the dataset but if we observe the dataset, we can see that lots of features have 0 values.

For example in BloodPressure, BMI, Glucose, Insulin etc having 0 for those features make no sense like having 0 bloodpressure or BMI etc.

It appears as if zero was used as a imputer value in original data for missing values. We can treat them as missing and impute with proper values.

It is better to replace zeros with NaN because counting them would be easier and need to be replaced with suitable values.

In [8]:
#Replacing zeroes with Nan value

data_copy = data.copy(deep = True)
data_copy[['Glucose','BloodPressure','SkinThickness','Insulin','BMI']] = data_copy[['Glucose','BloodPressure','SkinThickness','Insulin','BMI']].replace(0,np.NaN)

## showing the count of Nans

print(data_copy.isnull().sum())

Pregnancies                   0
Glucose                       5
BloodPressure                35
SkinThickness               227
Insulin                     374
BMI                          11
DiabetesPedigreeFunction      0
Age                           0
Outcome                       0
dtype: int64


We can see that the features Glucose, BloodPressure, SkinThickness, Insulin, BMI have missing values.

In [9]:
#Percentage of NaNs in each column

round(data_copy.isnull().sum()/len(data_copy)*100,2)

Pregnancies                  0.00
Glucose                      0.65
BloodPressure                4.56
SkinThickness               29.56
Insulin                     48.70
BMI                          1.43
DiabetesPedigreeFunction     0.00
Age                          0.00
Outcome                      0.00
dtype: float64

Insulin column has almost 50% of the values as missing. It would be hard to treat these outliers for model building.

In [10]:
#For easy access, renaming data_copy as df
df=data_copy


The missing values can either be imputed with mean or median or any value or they can be deleted from the dataset for further analysis. Here, we are trying to replace them with mean and median values according to their distribution.

In [11]:
#Checking the skewness of data

df.skew()

Pregnancies                 0.901674
Glucose                     0.530989
BloodPressure               0.134153
SkinThickness               0.690619
Insulin                     2.166464
BMI                         0.593970
DiabetesPedigreeFunction    1.919911
Age                         1.129597
Outcome                     0.635017
dtype: float64

For highly skewed values we'll impute the column with median else mean.

Glucose, BloodPressure, SkinThickness, Insulin, BMI are the columns with missing values.

In [12]:
# Highly skewed
df['BMI'].fillna(df['BMI'].median(), inplace = True)
df['SkinThickness'].fillna(df['SkinThickness'].median(), inplace = True)
df['Insulin'].fillna(df['Insulin'].median(), inplace = True)

#Normal
df['Glucose'].fillna(df['Glucose'].mean(), inplace = True)
df['BloodPressure'].fillna(df['BloodPressure'].mean(), inplace = True)


In [13]:
df.describe().transpose()

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Pregnancies,768.0,3.845052,3.369578,0.0,1.0,3.0,6.0,17.0
Glucose,768.0,121.686763,30.435949,44.0,99.75,117.0,140.25,199.0
BloodPressure,768.0,72.405184,12.096346,24.0,64.0,72.202592,80.0,122.0
SkinThickness,768.0,29.108073,8.791221,7.0,25.0,29.0,32.0,99.0
Insulin,768.0,140.671875,86.38306,14.0,121.5,125.0,127.25,846.0
BMI,768.0,32.455208,6.875177,18.2,27.5,32.3,36.6,67.1
DiabetesPedigreeFunction,768.0,0.471876,0.331329,0.078,0.24375,0.3725,0.62625,2.42
Age,768.0,33.240885,11.760232,21.0,24.0,29.0,41.0,81.0
Outcome,768.0,0.348958,0.476951,0.0,0.0,0.0,1.0,1.0


 Missing values have been handled.

In [14]:
df.duplicated().sum()

0

There are no duplicate values in the data

There are no categorical variables in the data except the target variable which is Outcome. 
It has the values 0 and 1 which represent if a patient is non-diabetic or diabetic.

In [15]:
#Splitting the dataset into X and y
X=df.drop(columns = ['Outcome'])
y=df['Outcome']

In [16]:
# splitting the data into testing and training data.

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

## PCA

Principal component analysis, or PCA, is a statistical technique to convert high dimensional data to low dimensional data by selecting the most important features that capture maximum information about the dataset. The features are selected on the basis of variance that they cause in the output. 

The feature that causes highest variance is the first principal component. The feature that is responsible for second highest variance isconsidered the second principal component, and so on.

PCA performs best with a normalized feature set. So we have to perform standard scalar normalization to normalize the feature set before applying PCA

In [17]:
#Scaling the data
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

As the Number of components are not specified, all five of the features will be returned.

In [18]:
from sklearn.decomposition import PCA # Importing PCA
pca = PCA() # No number of components specified
X_train = pca.fit_transform(X_train)
X_test = pca.transform(X_test)

The PCA class contains explained_variance_ratio_ which returns the variance caused by each of the principal components.

#### Variance

In [19]:
explained_variance = pca.explained_variance_ratio_
explained_variance

array([0.28111397, 0.19037219, 0.1458825 , 0.11425287, 0.09651666,
       0.06736909, 0.0577707 , 0.04672202])

The first principal component is responsible for 28% variance. The second principle component causes almost 20% variance in the data. Approximately 50% of the classification information contained in the features is captured by the first two principal components.

Using one principal component:

In [20]:
pca = PCA(n_components=1) #Using 1 principal component
X_train_1 = pca.fit_transform(X_train)
X_test_1 = pca.transform(X_test)

In [21]:
from sklearn.ensemble import RandomForestClassifier
classifier = RandomForestClassifier(max_depth=2, random_state=0)
classifier.fit(X_train_1, y_train)
y_pred = classifier.predict(X_test_1) # Predicting test set results

In [22]:
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
cm = confusion_matrix(y_test, y_pred)
print(cm)
print('\nAccuracy: ' + str(accuracy_score(y_test, y_pred)))


[[99  8]
 [26 21]]

Accuracy: 0.7792207792207793


With only one feature, the random forest algorithm is able to correctly predict 120 out of 154 instances, resulting in 77.9% accuracy.

In [28]:
pca = PCA(n_components=2) #Using 1 principal component
X_train_2 = pca.fit_transform(X_train)
X_test_2 = pca.transform(X_test)

In [29]:
from sklearn.ensemble import RandomForestClassifier
classifier = RandomForestClassifier(max_depth=2, random_state=0)
classifier.fit(X_train_2, y_train)
y_pred = classifier.predict(X_test_2) # Predicting test set results

In [30]:
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
cm = confusion_matrix(y_test, y_pred)
print(cm)
print('\nAccuracy: ' + str(accuracy_score(y_test, y_pred)))

[[98  9]
 [29 18]]

Accuracy: 0.7532467532467533


With two features, the random forest algorithm is able to correctly predict 116 out of 154 instances, resulting in 75.3% accuracy

In [31]:
pca = PCA(n_components=3) #Using 3 principal component
X_train_3 = pca.fit_transform(X_train)
X_test_3 = pca.transform(X_test)

from sklearn.ensemble import RandomForestClassifier
classifier = RandomForestClassifier(max_depth=2, random_state=0)
classifier.fit(X_train_3, y_train)
y_pred = classifier.predict(X_test_3) # Predicting test set results

from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
cm = confusion_matrix(y_test, y_pred)
print(cm)
print('\nAccuracy: ' + str(accuracy_score(y_test, y_pred)))

[[105   2]
 [ 38   9]]

Accuracy: 0.7402597402597403


With three features, the random forest algorithm is able to correctly predict 114 out of 154 instances, resulting in 74.02% accuracy

Here, we can observe that as the features are increased in PCA, the accuracy seems to be decreasing. There is only feature which contributes maximum to the target variable.

### LDA


Linear discriminant analysis, or LDA, tries to reduce dimensions of the feature set while retaining the information that discriminates output classes. It
is different from PCA in the manner that it relies on the output. It is a supervised technique

In [24]:
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
lda = LDA(n_components=1) #Number of components = 1
X_train_l = lda.fit_transform(X_train, y_train)
X_test_l = lda.transform(X_test)
LDA_df = pd.DataFrame(X_test_l)
LDA_df.head()


Unnamed: 0,0
0,3.105905
1,-0.656733
2,-1.297629
3,1.240923
4,-0.999044


In [25]:
from sklearn.ensemble import RandomForestClassifier
classifier = RandomForestClassifier(max_depth=2, random_state=0)
classifier.fit(X_train_l, y_train)
y_pred2 = classifier.predict(X_train_l)
y_pred = classifier.predict(X_test_l)

In [26]:
#Train accuracy
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
cm = confusion_matrix(y_train, y_pred2)
print(cm)
print('\nAccuracy: ' + str(accuracy_score(y_train, y_pred2)))

[[358  35]
 [107 114]]

Accuracy: 0.7687296416938111


In [27]:
#Test accuracy
cm = confusion_matrix(y_test, y_pred)
print(cm)
print('\nAccuracy: ' + str(accuracy_score(y_test, y_pred)))

[[100   7]
 [ 22  25]]

Accuracy: 0.8116883116883117


__With one linear discriminant, the model achieved an accuracy of 81.16%. This is more than the accuracy achieved with one principal component which is 77%.__

