Previously, in Lab 3, we learned how to use some ML models in scikit learn package on a regression task with some data preprocessing procedures. This week, we are going to review the data preprocessing procedures and apply logistic regression as well as support vector machine (SVM) on a classification task.

Task

This database is collected from the V.A. Medical Center, Long Beach and Cleveland Clinic Foundation. It contains information from 303 patients,  with 14 attributes (13 input variables and 1 target variable). 

We are using this dataet to Build a machine learning model to predict if a patiet presents heart disease. The detailed information of each variable is as follows:
1. age: age in years
2. sex (male and female)
3. chest pain type
4. resting blood pressure (in mm Hg on admission to the hospital)
5. serum cholestoral in mg/dl
6. fasting blood sugar > 120 mg/dl (true and false)
7. resting electrocardiographic results
<br>   -- Value 0: normal
<br>   -- Value 1: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV)
<br>   -- Value 2: showing probable or definite left ventricular hypertrophy by Estes' criteria
8. maximum heart rate achieved
9. exang: exercise induced angina (1 = yes; 0 = no)
10. oldpeak = ST depression induced by exercise relative to rest
11. slope: the slope of the peak exercise ST segment
<br>   -- Value 1: upsloping
<br>   -- Value 2: flat
<br>   -- Value 3: downsloping
12. number of major vessels (0-3) colored by flourosopy
13. thal: 3 = normal; 6 = fixed defect; 7 = reversable defect
14. num: diagnosis of heart disease (angiographic disease status)
<br>   -- Value 0: absense
<br>   -- Value 1: presence

more information of the dataset can be found here: https://archive.ics.uci.edu/ml/datasets/Heart+Disease

### Load the dataset
use pandas to load the csv file "heart_disease.csv" provided on LMS, then check dataset length and print the first 5 rows of the dataset

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

df = pd.read_csv("./heart_disease.csv")

### Preprocess the dataset
##### Check if there is any missing value in the dataset

In [2]:
df.isna().sum()

Age                                     0
Sex                                     0
Chest Pain Type                         0
Resting Blood Pressure                  0
Serum Cholestoral                       0
Fasting Blood Sugar                     0
Resting electrocardiographic results    0
Maximum heart rate achieved             0
Exercise induced angina                 0
ST depression                           0
the slope                               0
Number of major vessels                 4
thal                                    2
Diagnosis                               0
dtype: int64

##### Drop the rows which has missing values

In [3]:
df = df.dropna()
df.isna().sum()

Age                                     0
Sex                                     0
Chest Pain Type                         0
Resting Blood Pressure                  0
Serum Cholestoral                       0
Fasting Blood Sugar                     0
Resting electrocardiographic results    0
Maximum heart rate achieved             0
Exercise induced angina                 0
ST depression                           0
the slope                               0
Number of major vessels                 0
thal                                    0
Diagnosis                               0
dtype: int64

##### Check variable data types

In [4]:
df.dtypes

Age                                       int64
Sex                                      object
Chest Pain Type                          object
Resting Blood Pressure                    int64
Serum Cholestoral                         int64
Fasting Blood Sugar                        bool
Resting electrocardiographic results      int64
Maximum heart rate achieved               int64
Exercise induced angina                   int64
ST depression                           float64
the slope                                 int64
Number of major vessels                 float64
thal                                    float64
Diagnosis                                 int64
dtype: object

We found that Number of major vessels and thal should be int but is presented as float, so we transform them into integer type

In [5]:
cols = ['Number of major vessels', 'thal']
df[cols] = df[cols].astype(int)

In [6]:
# check again
df.dtypes

Age                                       int64
Sex                                      object
Chest Pain Type                          object
Resting Blood Pressure                    int64
Serum Cholestoral                         int64
Fasting Blood Sugar                        bool
Resting electrocardiographic results      int64
Maximum heart rate achieved               int64
Exercise induced angina                   int64
ST depression                           float64
the slope                                 int64
Number of major vessels                   int64
thal                                      int64
Diagnosis                                 int64
dtype: object

We can see that these two variables are properly transformed now

##### Check if there is any duplicated rows in the dataset

In [7]:
df.duplicated().sum()

0

##### check value count for the categorical variables

In [8]:
df.value_counts()

Age  Sex   Chest Pain Type   Resting Blood Pressure  Serum Cholestoral  Fasting Blood Sugar  Resting electrocardiographic results  Maximum heart rate achieved  Exercise induced angina  ST depression  the slope  Number of major vessels  thal  Diagnosis
29   male  atypical angina   130                     204                False                2                                     202                          0                        0.0            1          0                        3     0            1
59   male  typical angina    170                     288                False                2                                     159                          0                        0.2            2          0                        7     1            1
                             134                     204                False                0                                     162                          0                        0.8            1          2                      

##### Deal with categorical variables

Since both Sex and Fasting Blook Sugar are binary variables, we can also use 0 and 1 to replace them.

for example, for variable Sex:
<br> 1 = male; 0 = female

for variable Fasting Blood Sugar:
<br> 1 = True; 0 = False

In addition, based on domain expert's advice, we can use the following rule to transform the categorical variable Chest Pain Type:
<br>-- Value 1: typical angina
<br>-- Value 2: atypical angina
<br>-- Value 3: non-anginal pain
<br>-- Value 4: asymptomatic

In [9]:
df["Sex"].unique()

array(['male', 'female'], dtype=object)

In [10]:
df["Fasting Blood Sugar"].unique()

array([ True, False])

In [11]:
df["Sex"] = df["Sex"].map({"male": 1, "female": 0})
df["Fasting Blood Sugar"] = df["Fasting Blood Sugar"].map({True: 1, False: 0})
df["Chest Pain Type"] = df["Chest Pain Type"].map({"typical angina": 0, "asymptomatic": 1, "non-anginal pain": 2, "atypical angina": 3})

##### Check dataset shape

In [12]:
df.shape

(297, 14)

##### Define the input variables and the target variable
target variable is the last variable Diagnosis, and input variables are the rest of the columns.

In [13]:
df.iloc[:, :-1]

Unnamed: 0,Age,Sex,Chest Pain Type,Resting Blood Pressure,Serum Cholestoral,Fasting Blood Sugar,Resting electrocardiographic results,Maximum heart rate achieved,Exercise induced angina,ST depression,the slope,Number of major vessels,thal
0,63,1,0,145,233,1,2,150,0,2.3,3,0,6
1,67,1,1,160,286,0,2,108,1,1.5,2,3,3
2,67,1,1,120,229,0,2,129,1,2.6,2,2,7
3,37,1,2,130,250,0,0,187,0,3.5,3,0,3
4,41,0,3,130,204,0,2,172,0,1.4,1,0,3
...,...,...,...,...,...,...,...,...,...,...,...,...,...
297,57,0,1,140,241,0,0,123,1,0.2,2,0,7
298,45,1,0,110,264,0,0,132,0,1.2,2,0,7
299,68,1,1,144,193,1,0,141,0,3.4,2,2,7
300,57,1,1,130,131,0,0,115,1,1.2,2,1,7


In [14]:
X = df.iloc[:, :-1]
y = df.iloc[:, -1]

### Split the dataset and normalize data

##### Split the training and testing dataset
use 10% of dataset for testing with a random state of 1

In [15]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=123)

##### Apply normalization on both train and testing dataset

In [16]:
from sklearn.preprocessing import normalize

X_train = normalize(X_train)
X_test = normalize(X_test)

In [17]:
X_train

array([[0.16416104, 0.00278239, 0.        , ..., 0.00834717, 0.        ,
        0.01947673],
       [0.17301833, 0.00332728, 0.00332728, ..., 0.00332728, 0.00998183,
        0.02329093],
       [0.09833165, 0.00252132, 0.00504265, ..., 0.00252132, 0.        ,
        0.00756397],
       ...,
       [0.16644923, 0.00308239, 0.00308239, ..., 0.00308239, 0.        ,
        0.00924718],
       [0.22632717, 0.        , 0.00917543, ..., 0.00305848, 0.00305848,
        0.00917543],
       [0.15097326, 0.00314528, 0.00314528, ..., 0.00314528, 0.        ,
        0.00943583]])

In [18]:
X_test

array([[1.59893953e-01, 0.00000000e+00, 3.01686703e-03, 3.92192714e-01,
        7.96452896e-01, 0.00000000e+00, 6.03373406e-03, 4.31411985e-01,
        0.00000000e+00, 1.20674681e-03, 6.03373406e-03, 0.00000000e+00,
        9.05060109e-03],
       [1.47647826e-01, 0.00000000e+00, 2.63656832e-03, 5.27313663e-01,
        7.59331675e-01, 2.63656832e-03, 5.27313663e-03, 3.50663586e-01,
        2.63656832e-03, 1.05462733e-02, 7.90970495e-03, 5.27313663e-03,
        1.84559782e-02],
       [2.07679543e-01, 3.51999225e-03, 3.51999225e-03, 4.92798915e-01,
        6.23038628e-01, 0.00000000e+00, 0.00000000e+00, 5.70238745e-01,
        3.51999225e-03, 0.00000000e+00, 3.51999225e-03, 3.51999225e-03,
        2.46399458e-02],
       [2.10045190e-01, 0.00000000e+00, 3.62146879e-03, 4.70790942e-01,
        7.13429351e-01, 0.00000000e+00, 0.00000000e+00, 4.74412411e-01,
        0.00000000e+00, 2.17288127e-03, 7.24293758e-03, 0.00000000e+00,
        1.08644064e-02],
       [1.22143241e-01, 3.05358103e-

### Now we are learning how to train a model with logistic regression and SVM for classification, based on entire training dataset and then evaluate the model based on testing dataset
Be aware that, for regression model, the default evaluation metrics is R Squared. For regression task, the default evaluation metrics is accuracy

In [19]:
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC

# logistic regression model, parameters can be changed
model = LogisticRegression(solver="liblinear")
model.fit(X_train, y_train)
test_score = model.score(X_test, y_test)
print("Testing Accuracy of LR:", test_score)

# Support Vector Machine for classification, parameters can be changed
model = SVC()
model.fit(X_train, y_train)
test_score = model.score(X_test, y_test)
print("Testing Accuracy of SVC:", test_score)

Testing Accuracy of LR: 0.6
Testing Accuracy of SVC: 0.5666666666666667


### Train a model with 5-fold cross valiation

##### Define a 5 fold cross validation with data shufflling and set the random state with 2

In [20]:
from sklearn.model_selection import KFold

kf = KFold(n_splits=5, shuffle=True, random_state=2)
kf.get_n_splits(X)

5

##### Run the 5-fold cross validation and print the average accuracy score based on the cross validation results, and evaluate both model on the testing dataset

In [21]:
from sklearn.model_selection import cross_val_score

scores = cross_val_score(model, X_test, y_test, cv=kf)

scores

array([0.66666667, 0.33333333, 0.16666667, 0.33333333, 0.16666667])

### Optimize the Logistic Regression models with cross validation
The parameters that can be applied in grid_params can be found here: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html You can add values and parameters in the grid_params_lr.

In [22]:
# fine tune parameters for lr model
from sklearn.model_selection import GridSearchCV

grid_params_lr = {
    'penalty': ['l1', 'l2'],
    'C': [1, 10],
    'solver': ['saga', 'liblinear']
}

lr = LogisticRegression(max_iter=150)
gs_lr_result = GridSearchCV(lr, grid_params_lr, cv=kf).fit(X_train, y_train)
print(gs_lr_result.best_score_)



0.7193570929419986


### Evaluate the trained Logistic Regression model using testing dataset

In [23]:
scores = cross_val_score(gs_lr_result.best_estimator_, X_test, y_test, cv=kf)

scores

array([0.66666667, 0.5       , 0.5       , 0.5       , 0.5       ])

check the parameter setting for the best selected model

In [24]:
gs_lr_result.best_params_

{'C': 10, 'penalty': 'l1', 'solver': 'liblinear'}

### Optimize the SVM models with the same steps
Parameters for SVM model can be found here: https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html

Evaluate the trained Logistic Regression model using testing dataset

In [25]:
grid_params_svm = {
    'kernel': ['poly', 'rbf', 'sigmoid'],
    'C': [i for i in range(1, 11)],
    'gamma': ['scale', 'auto'],
}

gs_svm_result = GridSearchCV(model, grid_params_svm, cv=kf).fit(X_train, y_train)
print(gs_svm_result.best_score_)

0.715863032844165


check the parameter setting for the best selected model

In [26]:
gs_svm_result.best_params_

{'C': 7, 'gamma': 'scale', 'kernel': 'poly'}