## Day 33 Lecture 2 Assignment

In this assignment, we will learn about non linear SVM models. We will use the heart disease dataset loaded below and analyze the model generated for this dataset.

In [1]:
import numpy as np
import pandas as pd

from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.preprocessing import StandardScaler

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

import matplotlib.pyplot as plt

In [2]:
heart = pd.read_csv('https://tf-assets-prod.s3.amazonaws.com/tf-curric/data-science/framingham_heart_disease.csv')

In [3]:
heart.shape

(4238, 16)

In [4]:
heart.head()

Unnamed: 0,male,age,education,currentSmoker,cigsPerDay,BPMeds,prevalentStroke,prevalentHyp,diabetes,totChol,sysBP,diaBP,BMI,heartRate,glucose,TenYearCHD
0,1,39,4.0,0,0.0,0.0,0,0,0,195.0,106.0,70.0,26.97,80.0,77.0,0
1,0,46,2.0,0,0.0,0.0,0,0,0,250.0,121.0,81.0,28.73,95.0,76.0,0
2,1,48,1.0,1,20.0,0.0,0,0,0,245.0,127.5,80.0,25.34,75.0,70.0,0
3,0,61,3.0,1,30.0,0.0,0,1,0,225.0,150.0,95.0,28.58,65.0,103.0,1
4,0,46,3.0,1,23.0,0.0,0,0,0,285.0,130.0,84.0,23.1,85.0,85.0,0


This dataset helps us predict the probability of coronary heart diease (CHD) in the next 10 years given the risk factors for each subject in the study. Our target variable is `TenYearCHD`.

We'll start off by removing any rows containing missing data.

In [5]:
bin_cols = [
    'male',
    'currentSmoker',
    'BPMeds',
    'prevalentStroke',
    'prevalentHyp',
    'diabetes'
]

num_cols = [
    'age',
    'education',
    'cigsPerDay',
    'totChol',
    'sysBP',
    'diaBP',
    'BMI',
    'heartRate',
    'glucose'
]

In [6]:
heart = heart.dropna()

Then, we split the data into train and test with 20% of the data in the test subset.

In [7]:
X = heart.drop(columns='TenYearCHD')
y = heart['TenYearCHD']

In [8]:
# answer below:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42, stratify=y)




We will then scale the data using the standard scaler. Do this in the cell below.

In [17]:
# answer below:
preprocessing = ColumnTransformer(
    [('scale', StandardScaler(), num_cols)], remainder='passthrough')


In [18]:
pipeline_rbf = Pipeline(
    [('preprocessing', preprocessing), ('svm', SVC(kernel='rbf', C=10))])

In [19]:
pipeline_poly = Pipeline(
    [('preprocessing', preprocessing), ('svm', SVC(kernel='poly', C=10))])


In [20]:
pipeline_rbf.fit(X_train, y_train)

Pipeline(memory=None,
         steps=[('preprocessing',
                 ColumnTransformer(n_jobs=None, remainder='passthrough',
                                   sparse_threshold=0.3,
                                   transformer_weights=None,
                                   transformers=[('scale',
                                                  StandardScaler(copy=True,
                                                                 with_mean=True,
                                                                 with_std=True),
                                                  ['age', 'education',
                                                   'cigsPerDay', 'totChol',
                                                   'sysBP', 'diaBP', 'BMI',
                                                   'heartRate', 'glucose'])],
                                   verbose=False)),
                ('svm',
                 SVC(C=10, break_ties=False, cache_size=200, class_weight=None,

In [23]:
pipeline_rbf.fit(X_train, y_train)

print(f'Train score {pipeline_rbf.score(X_train, y_train)}')
print(f'Test score {pipeline_rbf.score(X_test, y_test)}\n')

Train score 0.9004787961696307
Test score 0.8497267759562842



In [24]:
pipeline_poly.fit(X_train, y_train)

Pipeline(memory=None,
         steps=[('preprocessing',
                 ColumnTransformer(n_jobs=None, remainder='passthrough',
                                   sparse_threshold=0.3,
                                   transformer_weights=None,
                                   transformers=[('scale',
                                                  StandardScaler(copy=True,
                                                                 with_mean=True,
                                                                 with_std=True),
                                                  ['age', 'education',
                                                   'cigsPerDay', 'totChol',
                                                   'sysBP', 'diaBP', 'BMI',
                                                   'heartRate', 'glucose'])],
                                   verbose=False)),
                ('svm',
                 SVC(C=10, break_ties=False, cache_size=200, class_weight=None,

In [25]:
pipeline_poly.fit(X_train, y_train)

print(f'Train score {pipeline_poly.score(X_train, y_train)}')
print(f'Test score {pipeline_poly.score(X_test, y_test)}\n')

Train score 0.8785909712722298
Test score 0.8524590163934426



Generate a polynomial SVC model and a RBF SVC model. Compare the performance, and the runtime, for the two models.

Which model overfits more? How would you improve the overfitting?

Look at a classification report and confusion matrix. How does the class balance affect your results?

In [26]:
y_pred = pipeline_rbf.predict(X_test)

In [27]:
# answer below:
print(classification_report(y_test, y_pred))


              precision    recall  f1-score   support

           0       0.86      0.98      0.92       620
           1       0.54      0.13      0.21       112

    accuracy                           0.85       732
   macro avg       0.70      0.56      0.57       732
weighted avg       0.81      0.85      0.81       732

