## Day 33 Lecture 2 Assignment

In this assignment, we will learn about non linear SVM models. We will use the heart disease dataset loaded below and analyze the model generated for this dataset.

In [1]:
%matplotlib inline

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

In [2]:
heart = pd.read_csv('https://tf-assets-prod.s3.amazonaws.com/tf-curric/data-science/framingham_heart_disease.csv')

In [3]:
heart.shape

(4238, 16)

In [4]:
heart.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4238 entries, 0 to 4237
Data columns (total 16 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   male             4238 non-null   int64  
 1   age              4238 non-null   int64  
 2   education        4133 non-null   float64
 3   currentSmoker    4238 non-null   int64  
 4   cigsPerDay       4209 non-null   float64
 5   BPMeds           4185 non-null   float64
 6   prevalentStroke  4238 non-null   int64  
 7   prevalentHyp     4238 non-null   int64  
 8   diabetes         4238 non-null   int64  
 9   totChol          4188 non-null   float64
 10  sysBP            4238 non-null   float64
 11  diaBP            4238 non-null   float64
 12  BMI              4219 non-null   float64
 13  heartRate        4237 non-null   float64
 14  glucose          3850 non-null   float64
 15  TenYearCHD       4238 non-null   int64  
dtypes: float64(9), int64(7)
memory usage: 529.9 KB


In [5]:
heart.head()

Unnamed: 0,male,age,education,currentSmoker,cigsPerDay,BPMeds,prevalentStroke,prevalentHyp,diabetes,totChol,sysBP,diaBP,BMI,heartRate,glucose,TenYearCHD
0,1,39,4.0,0,0.0,0.0,0,0,0,195.0,106.0,70.0,26.97,80.0,77.0,0
1,0,46,2.0,0,0.0,0.0,0,0,0,250.0,121.0,81.0,28.73,95.0,76.0,0
2,1,48,1.0,1,20.0,0.0,0,0,0,245.0,127.5,80.0,25.34,75.0,70.0,0
3,0,61,3.0,1,30.0,0.0,0,1,0,225.0,150.0,95.0,28.58,65.0,103.0,1
4,0,46,3.0,1,23.0,0.0,0,0,0,285.0,130.0,84.0,23.1,85.0,85.0,0


This dataset helps us predict the probability of coronary heart diease (CHD) in the next 10 years given the risk factors for each subject in the study. Our target variable is `TenYearCHD`.

We'll start off by removing any rows containing missing data.

In [6]:
# answer below:
heart_drop = heart.dropna(axis=0)
heart_drop.info()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 3656 entries, 0 to 4237
Data columns (total 16 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   male             3656 non-null   int64  
 1   age              3656 non-null   int64  
 2   education        3656 non-null   float64
 3   currentSmoker    3656 non-null   int64  
 4   cigsPerDay       3656 non-null   float64
 5   BPMeds           3656 non-null   float64
 6   prevalentStroke  3656 non-null   int64  
 7   prevalentHyp     3656 non-null   int64  
 8   diabetes         3656 non-null   int64  
 9   totChol          3656 non-null   float64
 10  sysBP            3656 non-null   float64
 11  diaBP            3656 non-null   float64
 12  BMI              3656 non-null   float64
 13  heartRate        3656 non-null   float64
 14  glucose          3656 non-null   float64
 15  TenYearCHD       3656 non-null   int64  
dtypes: float64(9), int64(7)
memory usage: 485.6 KB


Then, we split the data into train and test with 20% of the data in the test subset.

In [7]:
# answer below:
from sklearn.model_selection import train_test_split

y = heart_drop.TenYearCHD
X = heart_drop.drop(columns=['TenYearCHD'])

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)


We will then scale the data using the standard scaler. Do this in the cell below.

In [8]:
# answer below:
from sklearn.preprocessing import StandardScaler

ss = StandardScaler()

X_train_scaled = ss.fit_transform(X_train)
X_test_scaled = ss.transform(X_test)

Generate a polynomial SVC model and a RBF SVC model. Compare the performance, and the runtime, for the two models.

In [9]:
# answer below:
from sklearn.svm import SVC

svm_poly = SVC(kernel='poly')
svm_poly.fit(X_train_scaled, y_train)


svm_rbf = SVC(kernel='rbf')
svm_rbf.fit(X_train_scaled, y_train)

import timeit

start = timeit.default_timer()
print("Poly Training Score:", svm_poly.score(X_train_scaled, y_train))
print("Poly Testing Score:", svm_poly.score(X_test_scaled, y_test))
stop = timeit.default_timer()
print('Time: ', stop - start) 

start = timeit.default_timer()
print("RBF Training Score:", svm_rbf.score(X_train_scaled, y_train))
print("RBF Testing Score:", svm_rbf.score(X_test_scaled, y_test))
stop = timeit.default_timer()
print('Time: ', stop - start) 

Poly Training Score: 0.8734610123119015
Poly Testing Score: 0.8401639344262295
Time:  0.16114240000000102
RBF Training Score: 0.8614911080711354
RBF Testing Score: 0.8456284153005464
Time:  0.3141601000000005


Which model overfits more? How would you improve the overfitting?

Look at a classification report and confusion matrix. How does the class balance affect your results?

In [10]:
# answer below:
#Poly_training:
from sklearn.metrics import classification_report, confusion_matrix

y_pred_train = svm_poly.predict(X_train_scaled)
confusion = confusion_matrix(y_train, y_pred_train)
print(confusion)
report = classification_report(y_train, y_pred_train)
print(report)

[[2479    1]
 [ 369   75]]
              precision    recall  f1-score   support

           0       0.87      1.00      0.93      2480
           1       0.99      0.17      0.29       444

    accuracy                           0.87      2924
   macro avg       0.93      0.58      0.61      2924
weighted avg       0.89      0.87      0.83      2924



In [11]:
# Poly_testing:
y_pred_test = svm_poly.predict(X_test_scaled)
confusion = confusion_matrix(y_test, y_pred_test)
print(confusion)
report = classification_report(y_test, y_pred_test)
print(report)

[[608  11]
 [106   7]]
              precision    recall  f1-score   support

           0       0.85      0.98      0.91       619
           1       0.39      0.06      0.11       113

    accuracy                           0.84       732
   macro avg       0.62      0.52      0.51       732
weighted avg       0.78      0.84      0.79       732



In [12]:
# Rbf training:
y_pred_train = svm_rbf.predict(X_train_scaled)
confusion = confusion_matrix(y_train, y_pred_train)
print(confusion)
report = classification_report(y_train, y_pred_train)
print(report)

[[2480    0]
 [ 405   39]]
              precision    recall  f1-score   support

           0       0.86      1.00      0.92      2480
           1       1.00      0.09      0.16       444

    accuracy                           0.86      2924
   macro avg       0.93      0.54      0.54      2924
weighted avg       0.88      0.86      0.81      2924



In [13]:
#Rbf testing:
y_pred_test = svm_rbf.predict(X_test_scaled)
confusion = confusion_matrix(y_test, y_pred_test)
print(confusion)
report = classification_report(y_test, y_pred_test)
print(report)

[[618   1]
 [112   1]]
              precision    recall  f1-score   support

           0       0.85      1.00      0.92       619
           1       0.50      0.01      0.02       113

    accuracy                           0.85       732
   macro avg       0.67      0.50      0.47       732
weighted avg       0.79      0.85      0.78       732

