## Day 33 Lecture 2 Assignment

In this assignment, we will learn about non linear SVM models. We will use the heart disease dataset loaded below and analyze the model generated for this dataset.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as st

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.svm import SVC
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.multiclass import OneVsRestClassifier

In [2]:
heart = pd.read_csv('https://tf-assets-prod.s3.amazonaws.com/tf-curric/data-science/framingham_heart_disease.csv')

In [3]:
heart.shape

(4238, 16)

In [4]:
heart.head()

Unnamed: 0,male,age,education,currentSmoker,cigsPerDay,BPMeds,prevalentStroke,prevalentHyp,diabetes,totChol,sysBP,diaBP,BMI,heartRate,glucose,TenYearCHD
0,1,39,4.0,0,0.0,0.0,0,0,0,195.0,106.0,70.0,26.97,80.0,77.0,0
1,0,46,2.0,0,0.0,0.0,0,0,0,250.0,121.0,81.0,28.73,95.0,76.0,0
2,1,48,1.0,1,20.0,0.0,0,0,0,245.0,127.5,80.0,25.34,75.0,70.0,0
3,0,61,3.0,1,30.0,0.0,0,1,0,225.0,150.0,95.0,28.58,65.0,103.0,1
4,0,46,3.0,1,23.0,0.0,0,0,0,285.0,130.0,84.0,23.1,85.0,85.0,0


This dataset helps us predict the probability of coronary heart diease (CHD) in the next 10 years given the risk factors for each subject in the study. Our target variable is `TenYearCHD`.

We'll start off by removing any rows containing missing data.

In [5]:
heart = heart.dropna(axis=0)
heart.isna().mean()

male               0.0
age                0.0
education          0.0
currentSmoker      0.0
cigsPerDay         0.0
BPMeds             0.0
prevalentStroke    0.0
prevalentHyp       0.0
diabetes           0.0
totChol            0.0
sysBP              0.0
diaBP              0.0
BMI                0.0
heartRate          0.0
glucose            0.0
TenYearCHD         0.0
dtype: float64

Then, we split the data into train and test with 20% of the data in the test subset.

In [6]:
X = heart.drop(columns=['currentSmoker', 'TenYearCHD'])
X = pd.get_dummies(X, columns=['education'], drop_first=True)
y = heart['TenYearCHD']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=36, stratify=y
)

We will then scale the data using the standard scaler. Do this in the cell below.

In [7]:
scale = StandardScaler()
X_train_sc = scale.fit_transform(X_train)
X_test_sc = scale.fit_transform(X_test)


Generate a polynomial SVC model and a RBF SVC model. Compare the scores for the two models (train and test) and also compare the runtime of the two models.

In [8]:
model2 = SVC(kernel='rbf')
model2.fit(X_train, y_train)
display(
    model2.score(X_train, y_train),
    model2.score(X_test, y_test),
)



1.0

0.8469945355191257

In [None]:
degrees = [1, 3, 5]
for deg in degrees:
    model1 = OneVsRestClassifier(SVC(kernel='poly', degree=deg), n_jobs=-1)
    model1.fit(X_train, y_train)
    display(
        f'polynomial of {deg} degrees score:',
        model1.score(X_train, y_train),
        model1.score(X_test, y_test),
    )

'polynomial of 1 degrees score:'

0.8478112175102599

0.8469945355191257

Which model overfits more? Why? How would you improve the model that overfits more?

<span style="color:blue">RBF is WAY overfit, with a train score of 1.0 and score diff of 0.15.  Iterating/Lowering the Hyperparameter C would be the first step to correct this issue</span>