## Day 33 Lecture 2 Assignment

In this assignment, we will learn about non linear SVM models. We will use the heart disease dataset loaded below and analyze the model generated for this dataset.

In [28]:
%matplotlib inline

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import time

from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

In [2]:
heart = pd.read_csv('https://tf-assets-prod.s3.amazonaws.com/tf-curric/data-science/framingham_heart_disease.csv')

In [3]:
heart.shape

(4238, 16)

In [4]:
heart.head()

Unnamed: 0,male,age,education,currentSmoker,cigsPerDay,BPMeds,prevalentStroke,prevalentHyp,diabetes,totChol,sysBP,diaBP,BMI,heartRate,glucose,TenYearCHD
0,1,39,4.0,0,0.0,0.0,0,0,0,195.0,106.0,70.0,26.97,80.0,77.0,0
1,0,46,2.0,0,0.0,0.0,0,0,0,250.0,121.0,81.0,28.73,95.0,76.0,0
2,1,48,1.0,1,20.0,0.0,0,0,0,245.0,127.5,80.0,25.34,75.0,70.0,0
3,0,61,3.0,1,30.0,0.0,0,1,0,225.0,150.0,95.0,28.58,65.0,103.0,1
4,0,46,3.0,1,23.0,0.0,0,0,0,285.0,130.0,84.0,23.1,85.0,85.0,0


This dataset helps us predict the probability of coronary heart diease (CHD) in the next 10 years given the risk factors for each subject in the study. Our target variable is `TenYearCHD`.

We'll start off by removing any rows containing missing data.

In [5]:
# answer below:
heart.dropna(inplace=True)

In [8]:
heart.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3656 entries, 0 to 4237
Data columns (total 16 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   male             3656 non-null   int64  
 1   age              3656 non-null   int64  
 2   education        3656 non-null   float64
 3   currentSmoker    3656 non-null   int64  
 4   cigsPerDay       3656 non-null   float64
 5   BPMeds           3656 non-null   float64
 6   prevalentStroke  3656 non-null   int64  
 7   prevalentHyp     3656 non-null   int64  
 8   diabetes         3656 non-null   int64  
 9   totChol          3656 non-null   float64
 10  sysBP            3656 non-null   float64
 11  diaBP            3656 non-null   float64
 12  BMI              3656 non-null   float64
 13  heartRate        3656 non-null   float64
 14  glucose          3656 non-null   float64
 15  TenYearCHD       3656 non-null   int64  
dtypes: float64(9), int64(7)
memory usage: 485.6 KB


Then, we split the data into train and test with 20% of the data in the test subset.

In [9]:
# answer below:
X = heart.drop('TenYearCHD', axis=1)
Y = heart.TenYearCHD

X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2)

We will then scale the data using the standard scaler. Do this in the cell below.

In [12]:
# answer below:
scaler = StandardScaler()

X_train_ = scaler.fit_transform(X_train)
X_test_ = scaler.transform(X_test)

Generate a polynomial SVC model and a RBF SVC model. Compare the performance, and the runtime, for the two models.

In [38]:
# answer below:
start_time = time.time()
poly = SVC(kernel='poly', C=7, gamma='auto')
poly.fit(X_train_, y_train)
print(f'Polynomial Runtime: {time.time() - start_time}')

poly_pred = poly.predict(X_test_)
print(f'Polynomial Accuracy: {accuracy_score(y_test, poly_pred)}')

Polynomial Runtime: 0.7368285655975342
Polynomial Accuracy: 0.8415300546448088


In [34]:
start_time = time.time()
rbf = SVC(kernel='rbf', C=7, gamma='auto')
rbf.fit(X_train_, y_train)
print(f'RBF Runtime: {time.time() - start_time}')

rbf_pred = rbf.predict(X_test_)
print(f'RBF Accuracy: {accuracy_score(y_test, rbf_pred)}')

RBF Runtime: 0.39435744285583496
RBF Accuracy: 0.8387978142076503


Which model overfits more? How would you improve the overfitting?

Look at a classification report and confusion matrix. How does the class balance affect your results?

In [39]:
# answer below:
print(f'Polynomial Confusion Matrix: \n{confusion_matrix(y_test, poly_pred)}')
print(f'Polynomial Report: \n{classification_report(y_test, poly_pred)}')
print(f'\nRBF Confusion Matrix: \n{confusion_matrix(y_test, rbf_pred)}')
print(f'RBF Report: \n{classification_report(y_test, rbf_pred)}')

Polynomial Confusion Matrix: 
[[600  14]
 [102  16]]
Polynomial Report: 
              precision    recall  f1-score   support

           0       0.85      0.98      0.91       614
           1       0.53      0.14      0.22       118

    accuracy                           0.84       732
   macro avg       0.69      0.56      0.56       732
weighted avg       0.80      0.84      0.80       732


RBF Confusion Matrix: 
[[601  13]
 [105  13]]
RBF Report: 
              precision    recall  f1-score   support

           0       0.85      0.98      0.91       614
           1       0.50      0.11      0.18       118

    accuracy                           0.84       732
   macro avg       0.68      0.54      0.55       732
weighted avg       0.79      0.84      0.79       732



In [35]:
heart.TenYearCHD.value_counts()

0    3099
1     557
Name: TenYearCHD, dtype: int64

*The SVC with polynomial kernel is overfitting more while the one with RBF kernel finishes modelling faster. Increasing the C value should reduce overfitting for both models but not by much.*

*Looking at the class balance, the negatives heavily outweigh the positives. This resulted in low precision and recall scores for the positive classification.*