**CA05 Logistic Regression** - BSAN 6070

Sarah Olsen✌

Cardiovascular Disease (CVD) kills more people than cancer globally. A dataset of real heart patients collected from a 15 year heart study cohort is made available for this assignment. The dataset has 16 patient features. Note that none of the features include any Blood Test information.

**PART 1**

*Build the Logistic Regression Model*

This model is a binary classifier that predicts CVD Risk (Yes/No, or 1/0) based on patient information.

In [9]:
#Import some useful libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

In [2]:
#Read in the dataset from Github
data = pd.read_csv('https://github.com/ArinB/CA05-B-Logistic-Regression/raw/master/cvd_data.csv')

#Check the first few rows
data.head()

Unnamed: 0,cvd_4types,age_s1,race,educat,mstat,hip,neck20,waist,av_weight_kg,cgpkyr,tea15,srhype,parrptdiab,bend25,happy25,tired25,hlthlm25
0,0,54,1,2,1,110.0,40.0,108.0,87.5,34.0,0,1,0,1,2,3,4
1,0,56,3,2,1,113.0,34.0,107.0,83.5,0.0,0,0,0,2,2,1,3
2,0,54,1,3,1,110.0,44.5,105.0,86.2,49.5,0,0,0,3,2,6,4
3,0,54,1,3,1,129.0,42.5,110.0,89.1,0.0,0,0,0,3,2,1,3
4,0,51,3,2,1,122.0,37.0,113.0,81.3,0.0,0,0,0,2,1,1,2


In [3]:
#Check for nulls
data.isnull().sum()

cvd_4types      0
age_s1          0
race            0
educat          0
mstat           0
hip             0
neck20          0
waist           0
av_weight_kg    0
cgpkyr          0
tea15           0
srhype          0
parrptdiab      0
bend25          0
happy25         0
tired25         0
hlthlm25        0
dtype: int64

In [4]:
#Check the data types
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3242 entries, 0 to 3241
Data columns (total 17 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   cvd_4types    3242 non-null   int64  
 1   age_s1        3242 non-null   int64  
 2   race          3242 non-null   int64  
 3   educat        3242 non-null   int64  
 4   mstat         3242 non-null   int64  
 5   hip           3242 non-null   float64
 6   neck20        3242 non-null   float64
 7   waist         3242 non-null   float64
 8   av_weight_kg  3242 non-null   float64
 9   cgpkyr        3242 non-null   float64
 10  tea15         3242 non-null   int64  
 11  srhype        3242 non-null   int64  
 12  parrptdiab    3242 non-null   int64  
 13  bend25        3242 non-null   int64  
 14  happy25       3242 non-null   int64  
 15  tired25       3242 non-null   int64  
 16  hlthlm25      3242 non-null   int64  
dtypes: float64(5), int64(12)
memory usage: 430.7 KB


In [5]:
#Check the descriptive statistics
data.describe()

Unnamed: 0,cvd_4types,age_s1,race,educat,mstat,hip,neck20,waist,av_weight_kg,cgpkyr,tea15,srhype,parrptdiab,bend25,happy25,tired25,hlthlm25
count,3242.0,3242.0,3242.0,3242.0,3242.0,3242.0,3242.0,3242.0,3242.0,3242.0,3242.0,3242.0,3242.0,3242.0,3242.0,3242.0,3242.0
mean,0.590068,64.828809,1.094695,2.326342,1.3686,105.404832,37.550719,97.209904,82.945928,12.90401,0.430907,0.327884,0.067551,2.473782,2.281308,4.292721,3.864898
std,0.491897,10.400496,0.358237,0.697934,0.933871,10.280402,4.101003,13.59806,7.84965,20.156736,1.242444,0.469515,0.251012,0.672158,0.951695,1.021099,0.614247
min,0.0,39.0,1.0,1.0,1.0,44.0,25.0,67.0,57.4,0.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0
25%,0.0,57.0,1.0,2.0,1.0,99.0,34.425,88.0,78.2,0.0,0.0,0.0,0.0,2.0,2.0,4.0,4.0
50%,1.0,65.0,1.0,2.0,1.0,104.0,37.15,97.0,82.55,0.3,0.0,0.0,0.0,3.0,2.0,4.0,4.0
75%,1.0,73.0,1.0,3.0,1.0,110.0,40.5,106.0,86.575,20.475,0.0,1.0,0.0,3.0,3.0,5.0,4.0
max,1.0,90.0,3.0,4.0,8.0,168.0,53.0,135.0,136.7,170.5,30.0,1.0,1.0,3.0,6.0,6.0,5.0


In [6]:
#Define the input variables and the target variable
#The input variables are the 16 different patient features
#The target variable is the cvd_4types

x = data[['age_s1', 'race', 'educat', 'mstat', 'hip', 'neck20', 'waist', 'av_weight_kg', 'cgpkyr', 'tea15', 'srhype', 'parrptdiab', 'bend25', 'happy25', 'tired25', 'hlthlm25']]
y = data['cvd_4types']

In [7]:
#Split the data into a test set and train set
x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=101)

In [11]:
#Prepare the model
#Create a list of penalty types to loop through
model = LogisticRegression()
model.fit(x_train,y_train)
y_pred = model.predict(x_test)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


**PART 2**

*Display the Feature Importance*

List all the features sorted in the order of decreasing influence on
the CVD Risk.

In [12]:
feature_cvd = pd.DataFrame({'feature':list(x_train.columns), 'feature_cvd':[abs(i) for i in model.coef_[0]]})
feature_cvd.sort_values('feature_cvd', ascending = False)

Unnamed: 0,feature,feature_cvd
15,hlthlm25,0.547781
1,race,0.454128
2,educat,0.367839
12,bend25,0.171393
14,tired25,0.148969
11,parrptdiab,0.146222
10,srhype,0.083333
3,mstat,0.07685
6,waist,0.072238
4,hip,0.042297


**PART 3**

*Evaluating the Model*

In [None]:
#Calculate the AUC Score
auc_score = roc_auc_score(y_test, y_pred)
#Calculate the F1 Score
F1 = f1_score(y_test, y_pred)

print('auc:', auc_score)
print('F1::', F1)