## Day 33 Lecture 2 Assignment

In this assignment, we will learn about non linear SVM models. We will use the heart disease dataset loaded below and analyze the model generated for this dataset.

In [10]:
%matplotlib inline

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
import time
from sklearn.metrics import classification_report, confusion_matrix


In [4]:
heart = pd.read_csv('https://tf-assets-prod.s3.amazonaws.com/tf-curric/data-science/framingham_heart_disease.csv')

In [None]:
heart.shape

(4238, 16)

In [5]:
heart.head()

Unnamed: 0,male,age,education,currentSmoker,cigsPerDay,BPMeds,prevalentStroke,prevalentHyp,diabetes,totChol,sysBP,diaBP,BMI,heartRate,glucose,TenYearCHD
0,1,39,4.0,0,0.0,0.0,0,0,0,195.0,106.0,70.0,26.97,80.0,77.0,0
1,0,46,2.0,0,0.0,0.0,0,0,0,250.0,121.0,81.0,28.73,95.0,76.0,0
2,1,48,1.0,1,20.0,0.0,0,0,0,245.0,127.5,80.0,25.34,75.0,70.0,0
3,0,61,3.0,1,30.0,0.0,0,1,0,225.0,150.0,95.0,28.58,65.0,103.0,1
4,0,46,3.0,1,23.0,0.0,0,0,0,285.0,130.0,84.0,23.1,85.0,85.0,0


This dataset helps us predict the probability of coronary heart diease (CHD) in the next 10 years given the risk factors for each subject in the study. Our target variable is `TenYearCHD`.

We'll start off by removing any rows containing missing data.

In [6]:
# answer below:

heart.dropna(inplace=True)
heart.isnull().sum()

male               0
age                0
education          0
currentSmoker      0
cigsPerDay         0
BPMeds             0
prevalentStroke    0
prevalentHyp       0
diabetes           0
totChol            0
sysBP              0
diaBP              0
BMI                0
heartRate          0
glucose            0
TenYearCHD         0
dtype: int64

Then, we split the data into train and test with 20% of the data in the test subset.

In [8]:
# answer below:

X = heart.drop('TenYearCHD', axis=1)
y = heart.TenYearCHD


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)



We will then scale the data using the standard scaler. Do this in the cell below.

In [11]:
# answer below:

scale = StandardScaler()
X_train_scale = scale.fit_transform(X_train, y_train)
X_test_scale = scale.transform(X_test)

Generate a polynomial SVC model and a RBF SVC model. Compare the performance, and the runtime, for the two models.

In [12]:
# answer below:

poly_svc = SVC(kernel='poly', C=10, degree=3)
rbf_svc = SVC(kernel='rbf')

star_time = time.time()
poly_svc.fit(X_train_scale, y_train)
print(f'polynomial SVC train time: {time.time()-star_time} seconds\n')

print(
    f'polynomial training score: {poly_svc.score(X_train_scale, y_train)}\n'
    f'polynomial test score: {poly_svc.score(X_test_scale, y_test)}\n'
)

polynomial SVC train time: 0.8462419509887695 seconds

polynomial training score: 0.884404924760602
polynomial test score: 0.8497267759562842



In [13]:
star_time = time.time()
rbf_svc.fit(X_train_scale, y_train)
print(f'rbf SVC train time: {time.time()-star_time} seconds\n')

print(
    f'rbf training score: {rbf_svc.score(X_train_scale, y_train)}\n'
    f'rbf test score: {rbf_svc.score(X_test_scale, y_test)}\n'
)

rbf SVC train time: 0.3139011859893799 seconds

rbf training score: 0.8594391244870041
rbf test score: 0.8592896174863388



Which model overfits more? How would you improve the overfitting?

Look at a classification report and confusion matrix. How does the class balance affect your results?

In [14]:
# answer below:

poly_train_pred = poly_svc.predict(X_train_scale)
poly_test_pred = poly_svc.predict(X_test_scale)

rbf_train_pred = rbf_svc.predict(X_train_scale)
rbf_test_pred = rbf_svc.predict(X_test_scale)

In [15]:
print(
    f'--------------- Polynomial SVC --------------\n'
    f'Train\n'
    f'{classification_report(y_train, poly_train_pred)}\n'
    f'confusion matrix\n'
    f'{confusion_matrix(y_train, poly_train_pred)}\n'
    f'\nTest\n'
    f'{classification_report(y_test, poly_test_pred)}\n'
    f'confusion matrix\n'
    f'{confusion_matrix(y_test, poly_test_pred)}\n\n'
)

--------------- Polynomial SVC --------------
Train
              precision    recall  f1-score   support

           0       0.88      1.00      0.94      2469
           1       0.98      0.26      0.42       455

    accuracy                           0.88      2924
   macro avg       0.93      0.63      0.68      2924
weighted avg       0.90      0.88      0.85      2924

confusion matrix
[[2466    3]
 [ 335  120]]

Test
              precision    recall  f1-score   support

           0       0.87      0.98      0.92       630
           1       0.30      0.06      0.10       102

    accuracy                           0.85       732
   macro avg       0.58      0.52      0.51       732
weighted avg       0.79      0.85      0.80       732

confusion matrix
[[616  14]
 [ 96   6]]




In [16]:
print(
    f'--------------- rbf SVC --------------\n'
    f'Train\n'
    f'{classification_report(y_train, rbf_train_pred)}\n'
    f'confusion matrix\n'
    f'{confusion_matrix(y_train, rbf_train_pred)}\n'
    f'\nTest\n'
    f'{classification_report(y_test, rbf_test_pred)}\n'
    f'confusion matrix\n'
    f'{confusion_matrix(y_test, rbf_test_pred)}'
)

--------------- rbf SVC --------------
Train
              precision    recall  f1-score   support

           0       0.86      1.00      0.92      2469
           1       0.98      0.10      0.18       455

    accuracy                           0.86      2924
   macro avg       0.92      0.55      0.55      2924
weighted avg       0.88      0.86      0.81      2924

confusion matrix
[[2468    1]
 [ 410   45]]

Test
              precision    recall  f1-score   support

           0       0.86      1.00      0.92       630
           1       0.00      0.00      0.00       102

    accuracy                           0.86       732
   macro avg       0.43      0.50      0.46       732
weighted avg       0.74      0.86      0.80       732

confusion matrix
[[629   1]
 [102   0]]


few observations of the positive class both models are not
good at predicting.

polynomial model is best.