> ## *Please upvote if you like my efforts and provide valuable suggestions!*


**About the dataset:** Columns description-
* Sex: male or female(Nominal)
* Age: Age of the patient;(Continuous - Although the recorded ages have been truncated to whole numbers, the concept of age is continuous)
Behavioral
* Current Smoker: whether or not the patient is a current smoker (Nominal)
* Cigs Per Day: the number of cigarettes that the person smoked on average in one day.(can be considered continuous as one can have any number of cigarettes, even half a cigarette.)
Medical( history)
* BP Meds: whether or not the patient was on blood pressure medication (Nominal)
* Prevalent Stroke: whether or not the patient had previously had a stroke (Nominal)
* Prevalent Hyp: whether or not the patient was hypertensive (Nominal)
* Diabetes: whether or not the patient had diabetes (Nominal)
Medical(current)
* Tot Chol: total cholesterol level (Continuous)
* Sys BP: systolic blood pressure (Continuous)
* Dia BP: diastolic blood pressure (Continuous)
* BMI: Body Mass Index (Continuous)
* Heart Rate: heart rate (Continuous - In medical research, variables such as heart rate though in fact discrete, yet are considered continuous because of large number of possible values.)
* Glucose: glucose level (Continuous)
Predict variable (desired target)
* 10 year risk of coronary heart disease CHD (binary: “1”, means “Yes”, “0” means “No”)

In [None]:
# Importing libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import missingno as msno
%matplotlib inline
import warnings      #to avoid warnings
warnings.filterwarnings('ignore')
# importing data
data = pd.read_csv('../input/heart-disease-prediction-using-logistic-regression/framingham.csv')
print(f"Let's see first 5 rows of the dataset.")
data.head()

# EDA

I will use **pandas_profiling** library to understand the data.
It is a nice alternative to using .info and .describe methods. Infact, it gives much more useful informations like % of mising values, mean, maximum, minimum, heatmap depicting correlation, etc.

In [None]:
import pandas_profiling
data.profile_report()

**Conclusions-**
- Our data contains 4238 rows and 16 columns.
- Seems like this data is already preprocessed as there is no categorical columns. Those features have been encoded already.
Now, let's see missing values.
- Columns having mising values are: 'education'(2.5%), 'cigsPerDay'(0.7%), 'BPMeds'(1.3%), 'totChol'(1.2%) and 'glucose'(**9.2%**).
Except the feature glucose all other missing values are less than 2% of data. We can drop all other missing values but in this project I choose to use SimpleImputer to impute them with most frequent value.
As for feature glucose, notice in heatmap that glucose is highly correlated with diabetes. So, I will use feature diabetes to fill missing values in glucose.

In [None]:
sns.set(style="whitegrid",palette='Set2')

In [None]:
print("Distribution of boolean variables")
print(' “1” means “Yes”, “0” means “No”')
fig,axes = plt.subplots(nrows=2,ncols=3,figsize=(12,8))
sns.countplot(data.TenYearCHD,ax=axes[0,0])
sns.countplot(data.male,ax=axes[0,1])
axes[0,1].set_xlabel("0 is female and 1 is male")
sns.countplot(data.currentSmoker,ax=axes[0,2])
sns.countplot(data.BPMeds,ax=axes[1,0])
sns.countplot(data.prevalentStroke,ax=axes[1,1])
sns.countplot(data.prevalentHyp,ax=axes[1,2])
plt.tight_layout()

In [None]:
sns.set(style="darkgrid",palette='Set1')
print("Distribution of continuous variables")
fig,axes = plt.subplots(nrows=4,ncols=2,figsize=(12,8))
sns.distplot(data.age,ax=axes[0,0])
sns.distplot(data.BMI,ax=axes[0,1])
sns.distplot(data.glucose,ax=axes[1,0])
sns.distplot(data.cigsPerDay,ax=axes[1,1])
sns.distplot(data.sysBP,ax=axes[2,0])
sns.distplot(data.diaBP,ax=axes[2,1])
sns.distplot(data.totChol,ax=axes[3,0])
sns.distplot(data.heartRate,ax=axes[3,1])
plt.tight_layout()

In [None]:
#Plotting a linegraph to check the relationship between age and cigsPerDay, totChol, glucose.
graph_3 = data.groupby("age").cigsPerDay.mean()
graph_4 = data.groupby("age").totChol.mean()
graph_5 = data.groupby("age").glucose.mean()

plt.figure(figsize=(10,6))
sns.lineplot(data=graph_3, label="cigsPerDay")
sns.lineplot(data=graph_4, label="totChol")
sns.lineplot(data=graph_5, label="glucose")
plt.title("Graph showing totChol and cigsPerDay in every age group.",{'fontsize':18})
plt.xlabel("age", size=20)
plt.ylabel("count", size=20)
plt.xticks(size=12)
plt.yticks(size=12);

In [None]:
graph = data.groupby("age",as_index=False).currentSmoker.sum()
plt.figure(figsize=(10,6))
sns.barplot(x=graph["age"], y=graph["currentSmoker"])
plt.title("Graph showing which age group has more smokers.",{'fontsize':18});

# Handling missing values

In [None]:
# Let's have a visual look at missing data
msno.matrix(data);

In [None]:
data.groupby('diabetes').mean()['glucose']

In [None]:
def impute_glucose(cols):
    dia=cols[0]
    glu=cols[1]
    if pd.isnull(glu):
        if dia == 0:
            return 79
        else:
            return 170
    else:
        return glu

data['glucose'] = data[['diabetes','glucose']].apply(impute_glucose,axis=1)

In [None]:
#Another way to visualize missing data
sns.heatmap(data.isnull(),yticklabels=False,cbar=False,cmap='summer');

So, glucose feature has no missing data now.

In [None]:
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy='most_frequent')
imputer.fit(data)
imputed_data = imputer.transform(data)
imputed_data = pd.DataFrame(imputed_data,columns=data.columns)

In [None]:
print("just to cross-check all missing data is gone!")
msno.bar(imputed_data);

# Creating Model

In [None]:
#Libraries needed for model
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report,confusion_matrix,accuracy_score
from sklearn.model_selection import cross_val_score

In [None]:
#First split the data
X_train, X_test, y_train, y_test = train_test_split(imputed_data.drop('TenYearCHD',axis=1), 
                                                    imputed_data['TenYearCHD'], test_size=0.30, 
                                                    random_state=101)
logmodel = LogisticRegression()
logmodel.fit(X_train,y_train)
predictions = logmodel.predict(X_test)
print(classification_report(y_test,predictions))
print(confusion_matrix(y_test,predictions))

In [None]:
print('Model accuracy score: {0:0.4f}'. format(accuracy_score(y_test, predictions)))

**Cross-validation** is a technique for evaluating ML models by training several ML models on subsets of the available input data and evaluating them on the complementary subset of the data. In k-fold cross-validation, you split the input data into k subsets of data (also known as folds). You train an ML model on all but one (k-1) of the subsets, and then evaluate the model on the subset that was not used for training. This process is repeated k times, with a different subset reserved for evaluation (and excluded from training) each time.

In [None]:
score=cross_val_score(LogisticRegression(),imputed_data.drop('TenYearCHD',axis=1),imputed_data['TenYearCHD'],cv=10)
print(f"After k-fold cross validation score is {score.mean()}")

## Using GridSearchCV

In [None]:
from sklearn.model_selection import GridSearchCV
parameters = [{'penalty':['l1','l2']}, 
              {'C':[1, 10, 100, 1000]}]
grid_search = GridSearchCV(estimator = logmodel,  
                           param_grid = parameters,
                           scoring = 'accuracy',
                           cv = 5,
                           verbose=0)
grid_search.fit(X_train, y_train)
# best score achieved during the GridSearchCV
print('GridSearch CV best score : {:.4f}\n'.format(grid_search.best_score_))
# print parameters that give the best results
print(f'Parameters that give the best results : {grid_search.best_params_}')

In [None]:
score2=cross_val_score(grid_search,imputed_data.drop('TenYearCHD',axis=1),imputed_data['TenYearCHD'],cv=10)
print(f"After k-fold cross validation score is {score2.mean()}")

# CONCLUSION:

- Model accuracy score for logistic regression after cross-validation is 84.9% which is quite nice.
- Using GridSearchCV does not improve accuracy on this particular data.