# Logistic regression for cardiovascular disease detection
Using a logistic regression to detect CVD results in a model with 72% accuracy on the test set (0.71 and 0.73 on F1 scores for the classes, which is quite good)!

This is still in somewhat draft status as I wait to hear more about the codebook from the dataset uploader. I don't want to make any false assumptions in my interpretation.

Any feedback/questions are welcome.


## Step 1: EDA
Key things I'm looking for:
- data types:  we have a mix of categorical (inc. binary) and continuous variables, we need keep that in mind when preprocessing the data
- class imbalance: the data is approximately balanced, so we won't need to worry about balancing the dataset
- outliers: for the continous variables, we want to keep an eye on whether there are outliers

In [None]:
import pandas as pd
import numpy as np


In [None]:
data = pd.read_csv('/kaggle/input/cardio-vascular-disease-detection/cardio_train.csv', delimiter=';')

In [None]:
data.head()

In [None]:
data.info()

In [None]:
data.describe()

In [None]:
data.isna().sum()

Here, I create some data visualizations to get a better sense of the data and next steps:
- Loop through the categorical variables here and create "incidence" charts to help see the incidence (%) of cardiovascular disease by category. This tells me that certain categories definitely do have higher risk. 
- Loop through the continuous variables to get a sense of their distribution and the target. This suggests presence of outliers in most of the continuous variables.

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
def plot_incidence(feature):
    cats = set(data[feature].values)
    
    xs = range(0, len(cats))
    ys_bar=[]
    ys_line = []
    
    for cat in cats:
        ys_bar.append(data[data[feature] == cat].shape[0])
        ys_line.append(data[(data[feature] == cat) & (data.cardio == 1)].shape[0]/data.shape[0] * 100)
    
    fig, ax = plt.subplots()
    
    ax.bar(xs, ys_bar, color='grey')

    ax2 = ax.twinx()
    ax2.plot(xs, ys_line, color='teal')
    
    ax.set_xticks(xs)
    ax.set_xticklabels(cats, rotation=90)
    ax.set_xlabel(feature)
    
    ax.set_ylabel('Frequency (n)')
    ax2.set_ylabel('Incidence (%)')
    
    fig.suptitle(f"Cardio incidence by {feature}")
    
    return plt.show()

In [None]:
cont_vars = ['age', 'height', 'weight', 'ap_hi', 'ap_lo']

# first I transform the age variable to years instead of days
data.age = data.age.apply(lambda x: x / 365)

cat_vars = ['gender', 'cholesterol', 'gluc', 'smoke', 'alco', 'active']


In [None]:
for var in cat_vars:
    plot_incidence(var)

In [None]:
for var in cont_vars:
    _ = sns.boxplot(x='cardio', y=var, data=data)
    plt.show()

## Step 2: Data Processing

In [None]:
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegressionCV

In [None]:
# i've decided to engineer the bmi feature and remove the height and weight 
# height and weight in general aren't as informative as their ratio to another 
# i then code the bmi based on the standard categories
data['bmi'] = round(data.weight/data.height * 100, 2)
data['bmi_cat'] = pd.cut(data.bmi, pd.IntervalIndex.from_tuples([(0, 18.5), (18.5, 25), (25, 30), (30, 1000)]))
cat_vars.append('bmi_cat')

cont_vars.remove('height')
cont_vars.remove('weight')

# i also want to make age a categorical variable
data.age = data.age.apply(lambda x: round(x))
data['age_cat'] = pd.qcut(data.age, q=10, duplicates='drop', labels=[x for x in range(0, 10)])

cat_vars.append('age_cat')
cont_vars.remove('age')

# list for my final variables
final_vars = []
final_vars.extend(cont_vars)
final_vars.extend(cat_vars)

In [None]:
# split into train and test set
x_train, x_test, y_train, y_test = train_test_split(data[final_vars], data.cardio, test_size=.1, random_state=42)

In [None]:
# setting up a pipeline to transform categorical variables to one hot encoded variables and to scale continuous variables
ct = ColumnTransformer(transformers=[('onehot', OneHotEncoder(), cat_vars), ('scaler', StandardScaler(), cont_vars)])

# note that we only fit it using data that will be used to fit the model: x_train
ct.fit(x_train)

In [None]:
# transforming x_train and x_test according to pipeline
x_train = ct.transform(x_train)
x_test = ct.transform(x_test)

## Step 3: Training the Model

In [None]:
import time

lr = LogisticRegressionCV(cv=5, penalty='l1', solver='liblinear', refit=True, random_state=42)

start = time.time()
lr.fit(x_train, y_train.values)
end = time.time()
print(f"logistic regression fit in {(end - start) /60} mins")

## Step 4: Evaluating the Model
Based on the classification report, we have a strong model that does not appear to overfit the problem (as the performance on the train and test are quite similar). 

In [None]:
from sklearn.metrics import classification_report

for key, value in {'TRAIN': [x_train, y_train], 'TEST': [x_test, y_test]}.items():
    preds = lr.predict(value[0])
    print(f"{key} RESULTS\n\n{classification_report(preds, value[1])}\n\n")

## Step 5: Interpreting the Model

In [None]:
feature_names = []
feature_names.extend(cont_vars)

for cat in cat_vars:
    for val in set(data[cat].values):
        feature_names.append(f"{cat}_{val}")

In [None]:
feature_coefs = {feature: coefficient for feature, coefficient in zip(feature_names, lr.coef_[0])} 

In [None]:
feature_df = pd.Series(feature_coefs).to_frame()
feature_df = feature_df.reset_index()

feature_df.rename(columns={'index': 'feature', 0: 'log_prob'}, inplace=True)

feature_df['odds'] = feature_df.log_prob.apply(np.exp)

In [None]:
feature_df

In [None]:
# manually changing the odds ratio for age_cat_8 because it was stretching the plot too much
feature_df.at[28, 'odds'] = 4

In [None]:
# plotting
ys = [y for y in range(0, 30)]
xs = feature_df.odds.values
cs = []

for x in xs:
    if x < 1:
        cs.append('blue')
    elif x == 1:
        
        cs.append('grey')
    else:
        cs.append('orange')
    
fig = plt.figure(figsize=(8, 8))
_ = plt.scatter(xs, ys,s=30,color=cs)

plt.yticks(ticks=ys, labels=feature_df.feature.values)
plt.xlabel('Odds Ratio')
plt.ylabel('Feature')
plt.title('Odds Ratios for Features')
plt.show()

My preliminary interpretation is:
- risk increases with age, with particularly high risk at age 61-62 (age_cat_8)
- risk is also higher for gender 2 (I assume this is male)
- risk is highest for the underweight BMI, interestingly (BMI 0-18.5%)
- risk is highest with above normal glucose readings (interestingly not with the well above normal readings)