## Day 33 Lecture 2 Assignment

In this assignment, we will learn about non linear SVM models. We will use the heart disease dataset loaded below and analyze the model generated for this dataset.

In [2]:
%matplotlib inline

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

In [3]:
heart = pd.read_csv('https://tf-assets-prod.s3.amazonaws.com/tf-curric/data-science/framingham_heart_disease.csv')

In [4]:
heart.shape

(4238, 16)

In [5]:
heart.head()

Unnamed: 0,male,age,education,currentSmoker,cigsPerDay,BPMeds,prevalentStroke,prevalentHyp,diabetes,totChol,sysBP,diaBP,BMI,heartRate,glucose,TenYearCHD
0,1,39,4.0,0,0.0,0.0,0,0,0,195.0,106.0,70.0,26.97,80.0,77.0,0
1,0,46,2.0,0,0.0,0.0,0,0,0,250.0,121.0,81.0,28.73,95.0,76.0,0
2,1,48,1.0,1,20.0,0.0,0,0,0,245.0,127.5,80.0,25.34,75.0,70.0,0
3,0,61,3.0,1,30.0,0.0,0,1,0,225.0,150.0,95.0,28.58,65.0,103.0,1
4,0,46,3.0,1,23.0,0.0,0,0,0,285.0,130.0,84.0,23.1,85.0,85.0,0


This dataset helps us predict the probability of coronary heart diease (CHD) in the next 10 years given the risk factors for each subject in the study. Our target variable is `TenYearCHD`.

We'll start off by removing any rows containing missing data.

In [10]:
# answer below:
heart.isnull().sum()*100/heart.isnull().count()


male               0.000000
age                0.000000
education          2.477584
currentSmoker      0.000000
cigsPerDay         0.684285
BPMeds             1.250590
prevalentStroke    0.000000
prevalentHyp       0.000000
diabetes           0.000000
totChol            1.179802
sysBP              0.000000
diaBP              0.000000
BMI                0.448325
heartRate          0.023596
glucose            9.155262
TenYearCHD         0.000000
dtype: float64

In [11]:
heart.dropna(inplace=True)

In [13]:
#heart.isnull().sum()*100/heart.isnull().count()
heart.shape


(3656, 16)

Then, we split the data into train and test with 20% of the data in the test subset.

In [14]:
# answer below:
from sklearn.model_selection import train_test_split

X = heart.drop(columns='TenYearCHD')
y = heart['TenYearCHD']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)



We will then scale the data using the standard scaler. Do this in the cell below.

In [44]:
# answer below:
from sklearn.preprocessing import StandardScaler

scaled = StandardScaler()
X_train_sca = scaled.fit_transform(X_train)
X_test_sca = scaled.transform(X_test)


Generate a polynomial SVC model and a RBF SVC model. Compare the performance, and the runtime, for the two models.

In [73]:
# answer below:
from sklearn.svm import SVC
from datetime import datetime

start_time = datetime.now()

poly = SVC(C=.5 ,kernel='poly', gamma='auto')
poly.fit(X_train_sca, y_train)
train_sco = poly.score(X_train_sca, y_train)
test_sco = poly.score(X_test_sca, y_test)

end_time = datetime.now()

print('Duration: {}'.format(end_time - start_time))
print('Poly Train Score: ', train_sco)
print('Poly Test Score: ', test_sco)
print('')

start_time1 = datetime.now()
rbf = SVC(kernel='rbf', gamma='scale')
rbf.fit(X_train_sca, y_train)
train_sco1 = rbf.score(X_train_sca, y_train)
test_sco1 = rbf.score(X_test_sca, y_test)

end_time1 = datetime.now()
 
print('Duration: {}'.format(end_time1 - start_time1))
print('RBF Train Score: ', train_sco1)
print('RBF Test Score: ', test_sco1)

Duration: 0:00:00.294511
Poly Train Score:  0.8662790697674418
Poly Test Score:  0.8524590163934426

Duration: 0:00:00.481772
RBF Train Score:  0.8580711354309165
RBF Test Score:  0.8524590163934426


Which model overfits more? How would you improve the overfitting?

Look at a classification report and confusion matrix. How does the class balance affect your results?

In [74]:
# answer below:
from sklearn.metrics import classification_report, confusion_matrix


y_pred_train = poly.predict(X_train_sca)
y_pred_test = poly.predict(X_test_sca)

conf_train = confusion_matrix(y_train, y_pred_train)
conf_test = confusion_matrix(y_test, y_pred_test)

print(conf_train)
print('')
print(conf_test)
print('')
train_repo = classification_report(y_train, y_pred_train)
test_repo = classification_report(y_test, y_pred_test)
print(train_repo)
print('')
print(test_repo)


[[2474    1]
 [ 390   59]]

[[617   7]
 [101   7]]

              precision    recall  f1-score   support

           0       0.86      1.00      0.93      2475
           1       0.98      0.13      0.23       449

    accuracy                           0.87      2924
   macro avg       0.92      0.57      0.58      2924
weighted avg       0.88      0.87      0.82      2924


              precision    recall  f1-score   support

           0       0.86      0.99      0.92       624
           1       0.50      0.06      0.11       108

    accuracy                           0.85       732
   macro avg       0.68      0.53      0.52       732
weighted avg       0.81      0.85      0.80       732



In [75]:
y_pred_train1 = rbf.predict(X_train_sca)
y_pred_test1 = rbf.predict(X_test_sca)

conf_train1 = confusion_matrix(y_train, y_pred_train1)
conf_test1 = confusion_matrix(y_test, y_pred_test1)

print(conf_train1)
print('')
print(conf_test1)
print('')
train_repo1 = classification_report(y_train, y_pred_train1)
test_repo1 = classification_report(y_test, y_pred_test1)
print(train_repo1)
print('')
print(test_repo1)

[[2475    0]
 [ 415   34]]

[[623   1]
 [107   1]]

              precision    recall  f1-score   support

           0       0.86      1.00      0.92      2475
           1       1.00      0.08      0.14       449

    accuracy                           0.86      2924
   macro avg       0.93      0.54      0.53      2924
weighted avg       0.88      0.86      0.80      2924


              precision    recall  f1-score   support

           0       0.85      1.00      0.92       624
           1       0.50      0.01      0.02       108

    accuracy                           0.85       732
   macro avg       0.68      0.50      0.47       732
weighted avg       0.80      0.85      0.79       732

