# CS 4661: Introduction to Data Science
## Jay Tandel


### Problem: Predicting Heart Disease
In this question, we work with a dataset from the textbook of "An Introduction to Statistical Learning."

## A
Read the data file “Hearts_s.csv” (from github using the following command), and assign it to a Pandas DataFrame:   

df = pd.read_csv("https://github.com/mpourhoma/CS4661/raw/master/Heart_s.csv")

In [1]:
# Importing the required packages and libraries
import pandas as pd
import numpy as np
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.model_selection import cross_val_score

In [2]:
# Reading the csv file from web, and store it in panda DataFrame
df = pd.read_csv("https://github.com/mpourhoma/CS4661/raw/master/Heart_s.csv")

## B
Check out the dataset. As you see, the dataset contains a number of features including both contextual and biological factors (e.g. age, gender, vital signs, …). The last column “AHD” is the label with “Yes” meaning that a human subject has Heart Disease, and “No” meaning that the subject does not have Heart Disease.

In [3]:
#print the imported dataset
print(df)

     Age Gender     ChestPain  RestBP  Chol  RestECG  MaxHR  Oldpeak  \
0     63      f       typical     145   233        2    150      2.3   
1     67      f  asymptomatic     160   286        2    108      1.5   
2     67      f  asymptomatic     120   229        2    129      2.6   
3     37      f    nonanginal     130   250        0    187      3.5   
4     41      m    nontypical     130   204        2    172      1.4   
..   ...    ...           ...     ...   ...      ...    ...      ...   
296   45      f       typical     110   264        0    132      1.2   
297   68      f  asymptomatic     144   193        0    141      3.4   
298   57      f  asymptomatic     130   131        0    115      1.2   
299   57      m    nontypical     130   236        2    174      0.0   
300   38      f    nonanginal     138   175        0    173      0.0   

           Thal  AHD  
0         fixed   No  
1        normal  Yes  
2    reversable  Yes  
3        normal   No  
4        normal   No

## C
As you see, there are at least 3 categorical features in the dataset (Gender, ChestPain, Thal). Let’s ignore these categorical features for now, only keep the numerical features and build your feature matrix and label vector.

In [4]:
# Creating the Feature Matrix for the dataset:

# create a python list of feature names that would like to pick from the dataset:
feature_cols = ['Age','RestBP','Chol','RestECG','MaxHR','Oldpeak']

# use the above list to select the features from the original DataFrame
X = df[feature_cols] 


print(X)

     Age  RestBP  Chol  RestECG  MaxHR  Oldpeak
0     63     145   233        2    150      2.3
1     67     160   286        2    108      1.5
2     67     120   229        2    129      2.6
3     37     130   250        0    187      3.5
4     41     130   204        2    172      1.4
..   ...     ...   ...      ...    ...      ...
296   45     110   264        0    132      1.2
297   68     144   193        0    141      3.4
298   57     130   131        0    115      1.2
299   57     130   236        2    174      0.0
300   38     138   175        0    173      0.0

[301 rows x 6 columns]


In [5]:
# select label (the last column) from the DataFrame
y = df['AHD']

print(y)

# Change Categorical label to numeric label
# Replace No with 0 and Yes with 1
# y = y.replace('No',0)
# y = y.replace('Yes',1)
# print(y)

0       No
1      Yes
2      Yes
3       No
4       No
      ... 
296    Yes
297    Yes
298    Yes
299    Yes
300     No
Name: AHD, Length: 301, dtype: object


## D
Split the dataset into testing and training sets with the following parameters: test_size=0.25, random_state=6.

In [6]:
# Randomly splitting the original dataset into training set and testing set

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=6)


In [7]:
# print the training set:
print(X_train)
print('\n')
print(y_train)

     Age  RestBP  Chol  RestECG  MaxHR  Oldpeak
270   46     140   311        0    120      1.8
69    46     150   231        0    147      3.6
289   55     132   342        0    166      1.2
294   59     164   176        2     90      1.0
143   58     105   240        2    154      0.6
..   ...     ...   ...      ...    ...      ...
1     67     160   286        2    108      1.5
281   35     122   192        0    174      0.0
106   57     128   229        2    150      0.4
227   54     110   206        2    108      0.0
201   57     150   126        0    173      0.2

[225 rows x 6 columns]


270    Yes
69     Yes
289     No
294    Yes
143     No
      ... 
1      Yes
281     No
106    Yes
227    Yes
201     No
Name: AHD, Length: 225, dtype: object


In [8]:
# print the testing set:
print(X_test)
print('\n')
print(y_test)

     Age  RestBP  Chol  RestECG  MaxHR  Oldpeak
27    66     150   226        0    114      2.6
136   62     120   281        2    103      1.4
213   52     112   230        0    160      0.0
184   63     140   195        0    179      0.0
93    63     135   252        2    172      0.0
..   ...     ...   ...      ...    ...      ...
192   62     138   294        0    106      1.9
212   66     178   228        0    165      1.0
112   43     132   341        2    136      3.0
100   34     118   182        2    174      0.0
13    44     120   263        0    173      0.0

[76 rows x 6 columns]


27      No
136    Yes
213    Yes
184     No
93      No
      ... 
192    Yes
212    Yes
112    Yes
100     No
13      No
Name: AHD, Length: 76, dtype: object


## E
Use KNN (with k=3), Decision Tree (with random_state=5), and Logistic Regression Classifiers to predict Heart Disease based on the training/testing datasets that you built in part (d). Then check, compare, and report the accuracy of these 3 classifiers. Which one is the best? Which one is the worst?

In [9]:
# Using KNN

# Instantiating object of KNeighborsClassifier "class" with k=3:
k = 3
my_knn = KNeighborsClassifier(n_neighbors=k)

# Training the model with train dataset we created using split
my_knn.fit(X_train, y_train)

# Testing the model with test dataset we created using split

# store the predicted values
my_knn_y_predict = my_knn.predict(X_test)

# Calculating the Accuracy of the prediction 

# we use accuracy_score function
my_knn_accuracy = accuracy_score(y_test, my_knn_y_predict)

print('Accuracy using KNN classifier with k = 3 is: ',my_knn_accuracy)

Accuracy using KNN classifier with k = 3 is:  0.6447368421052632


In [10]:
# Using Decision Tree

# Instantiating object of DecisionTreeClassifier "class" with random_state=5:
my_decisiontree = DecisionTreeClassifier(random_state=5)

# Training the model with train dataset we created using split
my_decisiontree.fit(X_train, y_train)

# Testing the model with test dataset we created using split

# store the predicted values
my_decisiontree_y_predict = my_decisiontree.predict(X_test)

# Calculating the Accuracy of the prediction 

# we use accuracy_score function
my_decisiontree_accuracy = accuracy_score(y_test, my_decisiontree_y_predict)

print('Accuracy using Decision Tree is: ',my_decisiontree_accuracy)

Accuracy using Decision Tree is:  0.618421052631579


In [11]:
# Using Logistic Regression

# Instantiating object of LogisticRegression "class"
my_logisticregression = LogisticRegression(max_iter=1000)

# Training the model with train dataset we created using split
my_logisticregression.fit(X_train, y_train)

# Testing the model with test dataset we created using split

# store the predicted values
my_logisticregression_y_predict = my_logisticregression.predict(X_test)

# Calculating the Accuracy of the prediction 

# we use accuracy_score function
my_logisticregression_accuracy = accuracy_score(y_test, my_logisticregression_y_predict)

print('Accuracy using Logistic Regression is: ',my_logisticregression_accuracy)

Accuracy using Logistic Regression is:  0.6710526315789473


In [12]:
print("Best Algorithm is Logistic Regression with Accuracy Score: ", my_logisticregression_accuracy)
print("Worst Algorithm is Desicion Tree with Accuracy Score: ", my_decisiontree_accuracy)

Best Algorithm is Logistic Regression with Accuracy Score:  0.6710526315789473
Worst Algorithm is Desicion Tree with Accuracy Score:  0.618421052631579


## F
Now, we want to use the categorical features as well! To this end, we have to perform a feature engineering process called OneHotEncoding for the categorical features. To do this, each categorical feature should be replaced with dummy columns in the feature table (one column for each possible value of a categorical feature), and then encode it in a binary manner such that only one of the dummy columns can take “1” at a time (and zero for the rest). For example, “Gender” can take two values “m” and “f”. Thus, we need to replace this feature (in the feature table) by 2 columns titled “m” and “f”.  Wherever we have a male subject, we can put “1” and ”0” in the columns “m” and “f”.  Wherever we have a female subject, we can put “0” and ”1” in the columns “m” and “f”. (Hint: you will need 4 columns to encode “ChestPain” and 3 columns to encode “Thal”).

In [13]:
# Create a New feature Matrix

# create a python list of numerical feature names
feature_cols = ['Age','RestBP','Chol','RestECG','MaxHR','Oldpeak']

# use the above list to select the features
X = df[feature_cols]

# Generate Dummy values for Gender
dummy_gender = pd.get_dummies(df['Gender'])

print(dummy_gender.head())

# Concatinate the Dummy values of Gender with feature matrix
X = pd.concat([X, dummy_gender], axis=1)
print(X.head())

   f  m
0  1  0
1  1  0
2  1  0
3  1  0
4  0  1
   Age  RestBP  Chol  RestECG  MaxHR  Oldpeak  f  m
0   63     145   233        2    150      2.3  1  0
1   67     160   286        2    108      1.5  1  0
2   67     120   229        2    129      2.6  1  0
3   37     130   250        0    187      3.5  1  0
4   41     130   204        2    172      1.4  0  1


In [14]:
# Generate Dummy values for ChestPain
dummy_chestpain = pd.get_dummies(df['ChestPain'])

print(dummy_chestpain.head())
# Concatinate the Dummy values of ChestPain with feature matrix
X = pd.concat([X, dummy_chestpain], axis=1)
print(X.head())

   asymptomatic  nonanginal  nontypical  typical
0             0           0           0        1
1             1           0           0        0
2             1           0           0        0
3             0           1           0        0
4             0           0           1        0
   Age  RestBP  Chol  RestECG  MaxHR  Oldpeak  f  m  asymptomatic  nonanginal  \
0   63     145   233        2    150      2.3  1  0             0           0   
1   67     160   286        2    108      1.5  1  0             1           0   
2   67     120   229        2    129      2.6  1  0             1           0   
3   37     130   250        0    187      3.5  1  0             0           1   
4   41     130   204        2    172      1.4  0  1             0           0   

   nontypical  typical  
0           0        1  
1           0        0  
2           0        0  
3           0        0  
4           1        0  


In [15]:
# Generate Dummy values for Thal
dummy_thal = pd.get_dummies(df['Thal'])

print(dummy_chestpain.head())
# Concatinate the Dummy values of Thal with feature matrix
X = pd.concat([X, dummy_thal], axis=1)
print(X.head())

   asymptomatic  nonanginal  nontypical  typical
0             0           0           0        1
1             1           0           0        0
2             1           0           0        0
3             0           1           0        0
4             0           0           1        0
   Age  RestBP  Chol  RestECG  MaxHR  Oldpeak  f  m  asymptomatic  nonanginal  \
0   63     145   233        2    150      2.3  1  0             0           0   
1   67     160   286        2    108      1.5  1  0             1           0   
2   67     120   229        2    129      2.6  1  0             1           0   
3   37     130   250        0    187      3.5  1  0             0           1   
4   41     130   204        2    172      1.4  0  1             0           0   

   nontypical  typical  fixed  normal  reversable  
0           0        1      1       0           0  
1           0        0      0       1           0  
2           0        0      0       0           1  
3          

In [16]:
# New Feature Matrix
print(X)

     Age  RestBP  Chol  RestECG  MaxHR  Oldpeak  f  m  asymptomatic  \
0     63     145   233        2    150      2.3  1  0             0   
1     67     160   286        2    108      1.5  1  0             1   
2     67     120   229        2    129      2.6  1  0             1   
3     37     130   250        0    187      3.5  1  0             0   
4     41     130   204        2    172      1.4  0  1             0   
..   ...     ...   ...      ...    ...      ... .. ..           ...   
296   45     110   264        0    132      1.2  1  0             0   
297   68     144   193        0    141      3.4  1  0             1   
298   57     130   131        0    115      1.2  1  0             1   
299   57     130   236        2    174      0.0  0  1             0   
300   38     138   175        0    173      0.0  1  0             0   

     nonanginal  nontypical  typical  fixed  normal  reversable  
0             0           0        1      1       0           0  
1             0

## G
Repeat parts (d) and (e) with the new dataset that you built in part (f). How does the prediction accuracy change for each method?

In [17]:
# Randomly splitting the original dataset into training set and testing set

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=6)


In [18]:
# print the training set:
print(X_train)
print('\n')
print(y_train)

     Age  RestBP  Chol  RestECG  MaxHR  Oldpeak  f  m  asymptomatic  \
270   46     140   311        0    120      1.8  1  0             1   
69    46     150   231        0    147      3.6  1  0             0   
289   55     132   342        0    166      1.2  0  1             0   
294   59     164   176        2     90      1.0  1  0             1   
143   58     105   240        2    154      0.6  1  0             0   
..   ...     ...   ...      ...    ...      ... .. ..           ...   
1     67     160   286        2    108      1.5  1  0             1   
281   35     122   192        0    174      0.0  1  0             0   
106   57     128   229        2    150      0.4  1  0             0   
227   54     110   206        2    108      0.0  1  0             1   
201   57     150   126        0    173      0.2  1  0             0   

     nonanginal  nontypical  typical  fixed  normal  reversable  
270           0           0        0      0       0           1  
69            1

In [19]:
# print the testing set:
print(X_test)
print('\n')
print(y_test)

     Age  RestBP  Chol  RestECG  MaxHR  Oldpeak  f  m  asymptomatic  \
27    66     150   226        0    114      2.6  0  1             0   
136   62     120   281        2    103      1.4  1  0             0   
213   52     112   230        0    160      0.0  1  0             1   
184   63     140   195        0    179      0.0  0  1             0   
93    63     135   252        2    172      0.0  0  1             0   
..   ...     ...   ...      ...    ...      ... .. ..           ...   
192   62     138   294        0    106      1.9  0  1             1   
212   66     178   228        0    165      1.0  0  1             1   
112   43     132   341        2    136      3.0  0  1             1   
100   34     118   182        2    174      0.0  1  0             0   
13    44     120   263        0    173      0.0  1  0             0   

     nonanginal  nontypical  typical  fixed  normal  reversable  
27            0           0        1      0       1           0  
136           0

In [20]:
# Using KNN

# Instantiating object of KNeighborsClassifier "class" with k=3:
k = 3
my_knn1 = KNeighborsClassifier(n_neighbors=k)

# Training the model with train dataset we created using split
my_knn1.fit(X_train, y_train)

# Testing the model with test dataset we created using split

# store the predicted values
my_knn1_y_predict = my_knn1.predict(X_test)

# Calculating the Accuracy of the prediction 

# we use accuracy_score function
my_knn1_accuracy = accuracy_score(y_test, my_knn1_y_predict)

print('Accuracy using KNN classifier with k = 3 is: ',my_knn1_accuracy)

Accuracy using KNN classifier with k = 3 is:  0.6447368421052632


In [21]:
# Using Decision Tree

# Instantiating object of DecisionTreeClassifier "class" with random_state=5:
my_decisiontree1 = DecisionTreeClassifier(random_state=5)

# Training the model with train dataset we created using split
my_decisiontree1.fit(X_train, y_train)

# Testing the model with test dataset we created using split

# store the predicted values
my_decisiontree1_y_predict = my_decisiontree1.predict(X_test)

# Calculating the Accuracy of the prediction 

# we use accuracy_score function
my_decisiontree1_accuracy = accuracy_score(y_test, my_decisiontree1_y_predict)

print('Accuracy using Decision Tree is: ',my_decisiontree1_accuracy)

Accuracy using Decision Tree is:  0.7368421052631579


In [22]:
# Using Logistic Regression

# Instantiating object of LogisticRegression "class"
my_logisticregression1 = LogisticRegression(max_iter=1000)

# Training the model with train dataset we created using split
my_logisticregression1.fit(X_train, y_train)

# Testing the model with test dataset we created using split

# store the predicted values
my_logisticregression1_y_predict = my_logisticregression1.predict(X_test)

# Calculating the Accuracy of the prediction 

# we use accuracy_score function
my_logisticregression1_accuracy = accuracy_score(y_test, my_logisticregression1_y_predict)

print('Accuracy using Logistic Regression is: ',my_logisticregression1_accuracy)

Accuracy using Logistic Regression is:  0.7763157894736842


In [23]:
print("Best Algorithm is Logistic Regression with Accuracy Score: ", my_logisticregression1_accuracy)
print("Worst Algorithm is KNN Classifier with Accuracy Score: ", my_knn1_accuracy)

Best Algorithm is Logistic Regression with Accuracy Score:  0.7763157894736842
Worst Algorithm is KNN Classifier with Accuracy Score:  0.6447368421052632


## H
Now, repeat part (e) with the new dataset that you built in part (f), but this time using Cross-Validation. Thus, rather than splitting the dataset into testing and training, use 10-fold Cross-Validation (as we learned in Lab4) to evaluate the classification methods and report the final prediction accuracy. 

 

In [24]:
# Applying 10-fold cross validation with "KNN Classifier":

# Instantiating object of KNeighborsClassifier "class" with k=3:
k = 3
my_knn2 = KNeighborsClassifier(n_neighbors=k)

knn2_accuracy_list = cross_val_score(my_knn2, X, y, cv=10, scoring='accuracy')

print(knn2_accuracy_list)

knn2_accuracy_mean = knn2_accuracy_list.mean()

print('Accuracy using KNN Classifier with 10-fold cross validation is: ', knn2_accuracy_mean)

[0.70967742 0.63333333 0.56666667 0.66666667 0.6        0.5
 0.66666667 0.7        0.56666667 0.73333333]
Accuracy using KNN Classifier with 10-fold cross validation is:  0.6343010752688172


In [25]:
# Applying 10-fold cross validation with "Decision Tree":

# Instantiating object of DecisionTree "class" 
my_decisiontree2 = DecisionTreeClassifier(random_state=5)

decisiontree2_accuracy_list = cross_val_score(my_decisiontree2, X, y, cv=10, scoring='accuracy')

print(decisiontree2_accuracy_list)

decisiontree2_accuracy_mean = decisiontree2_accuracy_list.mean()

print(decisiontree2_accuracy_mean)

[0.77419355 0.76666667 0.8        0.76666667 0.8        0.7
 0.7        0.66666667 0.7        0.83333333]
0.750752688172043


In [26]:
# Applying 10-fold cross validation with "Logistic Regression":

# Instantiating object of DecisionTree "class" 
my_logisticregression2 = LogisticRegression(max_iter=10000)

logisticregression2_accuracy_list = cross_val_score(my_logisticregression2, X, y, cv=10, scoring='accuracy')

print(logisticregression2_accuracy_list)

logisticregression2_accuracy_mean = logisticregression2_accuracy_list.mean()

print(logisticregression2_accuracy_mean)

[0.77419355 0.8        0.83333333 0.86666667 0.93333333 0.7
 0.76666667 0.8        0.8        0.83333333]
0.810752688172043


In [27]:
print("Best Algorithm is Logistic Regression with Average Accuracy Score: ", logisticregression2_accuracy_mean)
print("Worst Algorithm is KNN Classifier with Average Accuracy Score: ", knn2_accuracy_mean)

Best Algorithm is Logistic Regression with Average Accuracy Score:  0.810752688172043
Worst Algorithm is KNN Classifier with Average Accuracy Score:  0.6343010752688172


## Answer Summary

### E
#### Best Algorithm is Logistic Regression with Accuracy Score:  0.6710526315789473
#### Worst Algorithm is Desicion Tree with Accuracy Score:  0.618421052631579

### G
#### Best Algorithm is Logistic Regression with Accuracy Score:  0.7763157894736842
#### Worst Algorithm is KNN Classifier with Accuracy Score:  0.6447368421052632

### H
#### Best Algorithm is Logistic Regression with Average Accuracy Score:  0.810752688172043
#### Worst Algorithm is KNN Classifier with Average Accuracy Score:  0.6343010752688172

