## Day 33 Lecture 2 Assignment

In this assignment, we will learn about non linear SVM models. We will use the heart disease dataset loaded below and analyze the model generated for this dataset.

In [12]:
%matplotlib inline

import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.svm import SVC
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

In [3]:
heart = pd.read_csv('https://tf-assets-prod.s3.amazonaws.com/tf-curric/data-science/framingham_heart_disease.csv')

In [4]:
heart.shape

(4238, 16)

In [5]:
heart.head()

Unnamed: 0,male,age,education,currentSmoker,cigsPerDay,BPMeds,prevalentStroke,prevalentHyp,diabetes,totChol,sysBP,diaBP,BMI,heartRate,glucose,TenYearCHD
0,1,39,4.0,0,0.0,0.0,0,0,0,195.0,106.0,70.0,26.97,80.0,77.0,0
1,0,46,2.0,0,0.0,0.0,0,0,0,250.0,121.0,81.0,28.73,95.0,76.0,0
2,1,48,1.0,1,20.0,0.0,0,0,0,245.0,127.5,80.0,25.34,75.0,70.0,0
3,0,61,3.0,1,30.0,0.0,0,1,0,225.0,150.0,95.0,28.58,65.0,103.0,1
4,0,46,3.0,1,23.0,0.0,0,0,0,285.0,130.0,84.0,23.1,85.0,85.0,0


This dataset helps us predict the probability of coronary heart diease (CHD) in the next 10 years given the risk factors for each subject in the study. Our target variable is `TenYearCHD`.

We'll start off by removing any rows containing missing data.

In [6]:
def missingness_summary(df, print_log=False, sort='none'):
    summary = df.apply(lambda x: x.isna().sum() / x.shape[0])
    
    if sort == 'none':
        summary = summary
    elif sort == 'ascending':
        summary = summary.sort_values()
    elif sort == 'descending':
        summary = summary.sort_values(ascending=False)
    else:
        print('Invalid value for sort parameter.')
    
    if print_log:
        print(summary)
        
    return summary

In [7]:
# answer below:
missingness_summary(heart)


male               0.000000
age                0.000000
education          0.024776
currentSmoker      0.000000
cigsPerDay         0.006843
BPMeds             0.012506
prevalentStroke    0.000000
prevalentHyp       0.000000
diabetes           0.000000
totChol            0.011798
sysBP              0.000000
diaBP              0.000000
BMI                0.004483
heartRate          0.000236
glucose            0.091553
TenYearCHD         0.000000
dtype: float64

In [20]:
data = heart.drop(['education', 'cigsPerDay', 'BPMeds', 'totChol', 'BMI', 'heartRate', 'glucose'], axis=1)

Then, we split the data into train and test with 20% of the data in the test subset.

In [21]:
# answer below:
X = data.drop('TenYearCHD', axis=1)
y = data['TenYearCHD']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)




We will then scale the data using the standard scaler. Do this in the cell below.

In [34]:
# answer below:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)


Generate a polynomial SVC model and a RBF SVC model. Compare the performance, and the runtime, for the two models.

In [35]:
# answer below:

import timeit

start = timeit.default_timer()

poly = SVC(kernel='poly')
poly_model = poly.fit(X_train_scaled, y_train)

stop = timeit.default_timer()

print('Poly Time: ', stop - start)


start = timeit.default_timer()

rbf = SVC(kernel='rbf')
rbf_model = poly.fit(X_train_scaled, y_train)

stop = timeit.default_timer()

print('SVC Time: ', stop - start)



Poly Time:  0.9950518040000134
SVC Time:  1.0131561059997694


In [48]:
poly_score = poly_model.score(X_train_scaled, y_train)
print('Polynomial Score:', poly_score)

rbf_score = rbf_model.score(X_test_scaled, y_test)
print('RBF Score:', rbf_score)

Polynomial Score: 0.8545722713864307
RBF Score: 0.8525943396226415


Which model overfits more? How would you improve the overfitting?

Look at a classification report and confusion matrix. How does the class balance affect your results?

In [54]:
# answer below:
rbf_overfitting = rbf_model.predict(X_test_scaled) - y_test
poly_overfitting = poly_model.predict(X_test_scaled) - y_test

print('rbf_overfitting:', rbf_overfitting.mean())
print('poly_overfitting', poly_overfitting.mean())

print(classification_report(rbf_model.predict(X_test_scaled), y_test))
print(classification_report(poly_model.predict(X_test_scaled), y_test))

rbf_overfitting: -0.13561320754716982
poly_overfitting -0.13561320754716982
              precision    recall  f1-score   support

           0       0.99      0.86      0.92       838
           1       0.04      0.50      0.07        10

    accuracy                           0.85       848
   macro avg       0.52      0.68      0.50       848
weighted avg       0.98      0.85      0.91       848

              precision    recall  f1-score   support

           0       0.99      0.86      0.92       838
           1       0.04      0.50      0.07        10

    accuracy                           0.85       848
   macro avg       0.52      0.68      0.50       848
weighted avg       0.98      0.85      0.91       848



Class balance is very heavily biased towards a 0 value in TenYearCHD, so it is much less likely to have strong predictive power on values of 1 in the response variable

In [None]:
C