## Supervised Learning
### Logistic Regression

## Heart Attack Possibility
### Problem Statement:
With the given data you are required to identify the key hidden patterns accociated with heart attack and use the information to build a predictive model which can identify and predict the possibility of getting a heart attack.

### Solution:
Building a predictive model comprising of Logistic Regression which can identify the patients who are likely to have a heart attack and also predict the possibility of getting a heart attack.

### Approach:
- EDA: Exploratory Data Analysis.
- Preparing the data for modeling.
- Training the model.
- Model Evaluation.
- prediction on test data.

In [None]:
# Ignoring warnings
import warnings
warnings.filterwarnings('ignore')

In [None]:
# Importing relevant libraries.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
import sklearn

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import RFE
from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn import metrics

In [None]:
# Setting the visual preferance
plt.style.use('dark_background')

## Task 1: EDA - Exploratory Data Analysis
- ### Subtask 1.1: Read and understand the data.

In [None]:
df = pd.read_csv('../input/health-care-data-set-on-heart-attack-possibility/heart.csv')

In [None]:
df.head()

In [None]:
df.info()

In [None]:
df.shape

In [None]:
df.duplicated().sum()

In [None]:
round(df.isnull().sum()/len(df.index)*100,4)

##### Explanation: 
The data set has total of 303 rows and 14 columns out of which there are 13 independent and 1 dependent variable which are all numerical. There is no missing values in the data set and there is only 1 duplicate value which can be removed.

- ### Subtask 1.2: Assigning proper column names

In [None]:
df.rename(columns = {'cp': 'chest pain', 'trestbps': 'resting BP', 'chol': 'cholestoral', 'fbs': 'fasting Blood sugar',
                    'restecg': 'resting ECG', 'thalach': 'maximum heart rate', 'exang': 'exercise induced angina',
                    'oldpeak': 'ST depression', 'ca': 'no.of major vessels blocked', 'thal': 'defect'}, inplace = True)

In [None]:
df.head()

- ### Subtask 1.3: Outliers

In [None]:
var = df.drop('target', axis = 1).columns

In [None]:
plt.figure(figsize = (15,15))
for x in enumerate(var):
    plt.subplot(5,3,x[0]+1)
    sns.boxplot(x[1], data = df, palette = 'Purples')
plt.show()

In [None]:
outliers = ['resting BP', 'cholestoral', 'maximum heart rate', 'ST depression', 
            'no.of major vessels blocked', 'defect']

In [None]:
plt.figure(figsize = (15, 10))
for x in enumerate(outliers):
    plt.subplot(2,3,x[0]+1)
    sns.boxplot(x[1], data = df, palette = 'Purples')
plt.show()

In [None]:
df['resting BP'].quantile([0.25,0.50,0.75,0.90,0.95,0.99])

In [None]:
df.loc[df['resting BP']> 160, ['resting BP']] = 160

##### Explanation:
There are not much outliers in the dataset. There are very few data points which are outliers and this can be ignored. Only the variable 'resting BP' had around 4% of the data as outliers hence this was treated by capping the outliers to the 95th percentile. 

- ### Subtask 1.4: Data visualization

In [None]:
sns.countplot(x = 'sex', data = df, palette = 'Purples')
plt.show()

##### Explanation:
The data set has an imbalance ratio in gender. Only 96 observations have been taken for females and there is over 207 observations for male patients.

In [None]:
bins = [0,10,20,30,40,50,60,70,80]
labels = ['<10', '10-20', '21-30', '31-40', '41-50', '51-60', '61-70', '71-80']
df['age_group'] = pd.cut(x = df['age'], bins = bins, labels = labels)

In [None]:
plt.figure(figsize = (10,5))
sns.countplot(x = df['age_group'], hue= df['target'], palette= 'Purples')
plt.show()

##### Explanation:
The age group 41-50 has the highest chance of getting a heart attack when compared to all other age groups. The no. of heart attack patients are twice the no.of non heart attack patients in that age group.

In [None]:
one = df.loc[df['target'] == 1]
zero = df.loc[df['target'] == 0]

In [None]:
var = ['resting BP', 'cholestoral', 'maximum heart rate', 'ST depression']
plt.figure(figsize = (14,7))
for x in enumerate(var):
    plt.subplot(2,2,x[0]+1)
    sns.kdeplot(data = one[x[1]], shade = True, color = 'r')
    
for x in enumerate(var):
    plt.subplot(2,2,x[0]+1)
    sns.kdeplot(data = zero[x[1]], shade = True, color = 'c')
plt.show()

##### Explanation:
- The red line represents heart attack patients and the cyan line present non heart attack patients.
- The maximum heart rate appears to be very high for the heart attack patients. 
- The normal cholestrol should be <170 mg/dl. However as per the above distribution both the groups are having higher cholestrol level. 
- The normal resting BP should be 120 Hgmm. However majority of both the groups are having BP in the range of 120 - 130.

In [None]:
plt.figure(figsize = (10,5))
sns.countplot(x = 'chest pain', hue = 'target', data = df, palette = 'Purples')
plt.show()

##### Explanation:
Type 2 chest pain is more susceptible for getting a heart attack.

In [None]:
# Calculating the imbalance percentage.
label = ['Heart Attack', 'Non-Heart Attack']
explode = [0.1,0]
df['target'].value_counts().plot.pie(explode = explode, labels = label, shadow = True, startangle=60, 
                                      autopct='%1.1f%%', textprops = {'color' : 'k'})
plt.show()

## Task 2: Preparing the data for modeling
- ### Subtask 2.1: Dropping unnecessary columns

In [None]:
df.drop('age_group', axis = 1, inplace = True)

In [None]:
df.drop_duplicates(inplace = True)

- ### Subtask 2.2: Splitting the data into train-test and rescaling of variables

In [None]:
# Splitting the data into train and test.
df_train, df_test = train_test_split(df, train_size = 0.70, random_state = 100)
print(df_train.shape)
print(df_test.shape)

In [None]:
# Rescalling of variable.
var = ['age', 'chest pain', 'resting BP', 'cholestoral', 'resting ECG', 'maximum heart rate', 
       'ST depression', 'slope', 'no.of major vessels blocked', 'defect']

In [None]:
scaler = MinMaxScaler()
df_train[var] = scaler.fit_transform(df_train[var])
df_train.describe()

##### Explanation:Â¶
Variable trasnformation is a very vital step before building any ML algorithm.
- It helps in faster computation. (convergence happen quickly)
- If all the variables are in the same unit, then it is easier to interpret the results from the model.

- ### Subtask 2.3: Correlation and Heatmap

In [None]:
# Calculating the correlation between variables.
df_train.corr()

In [None]:
# Heatmap
plt.figure(figsize = (15,10))
heat = sns.heatmap(df_train.corr(), annot = True, cmap = 'Purples')
bottom, top = heat.get_ylim()
heat.set_ylim(bottom+0.5, top+0.5)
plt.show()

## Task 3: Training the model
- ### Subtask 3.1: Assigning X and Y

In [None]:
Y_train = df_train.pop('target')
X_train = df_train

- ### Subtask 3.2: RFE- Recursive Feature Elimination

In [None]:
log_reg = LogisticRegression()
rfe = RFE(log_reg, 10)
rfe_model = rfe.fit(X_train, Y_train)

In [None]:
pd.DataFrame(zip(X_train.columns, rfe_model.ranking_)).sort_values(by = 1, ascending = True)

In [None]:
col = X_train.columns[rfe_model.support_]
col

- ### Subtask 3.2: Stats Models
  - Note: Assuming alpha to be 0.05 with 95% of confidence interval.
        - H0: The variable is <= 0.05 and is significant in determining Heart Attack.
        - H1: The variable is > 0.05 and is insignificant in determining Heart Attack
  - Note: Assuming permitable VIF level to be <5.

In [None]:
# Model_1
X_train_sm = sm.add_constant(X_train[col])
log_model = sm.GLM(Y_train, X_train_sm, families = sm.families.Binomial).fit()
print(log_model.summary())

In [None]:
vif = pd.DataFrame()
vif['Features'] = col
vif['VIF'] = [variance_inflation_factor(X_train[col].values, x) for x in range(X_train[col].shape[1])]
vif['VIF'] = round(vif['VIF'],2)
vif = vif.sort_values(by = 'VIF', ascending = False)
vif

In [None]:
X_2 = X_train[col].drop('cholestoral', axis = 1)

In [None]:
X_2_sm = sm.add_constant(X_2)
log_model_2 = sm.GLM(Y_train, X_2_sm, families = sm.families.Binomial()).fit()
print(log_model_2.summary())

In [None]:
vif = pd.DataFrame()
vif['Features'] = X_2.columns
vif['VIF'] = [variance_inflation_factor(X_2.values, x) for x in range(X_2.shape[1])]
vif['VIF'] = round(vif['VIF'],2)
vif = vif.sort_values(by = 'VIF', ascending = False)
vif

In [None]:
X_3 = X_2.drop('slope', axis = 1)
X_3_sm = sm.add_constant(X_3)
log_model_3 = sm.GLM(Y_train, X_3_sm, families = sm.families.Binomial()).fit()
print(log_model_3.summary())

In [None]:
vif = pd.DataFrame()
vif['Features'] = X_3.columns
vif['VIF'] = [variance_inflation_factor(X_3.values, x) for x in range(X_3.shape[1])]
vif['VIF'] = round(vif['VIF'],2)
vif = vif.sort_values(by = 'VIF', ascending = False)
vif

In [None]:
X_4 = X_3.drop('exercise induced angina', axis = 1)
X_4_sm = sm.add_constant(X_4)
log_model_4 = sm.GLM(Y_train, X_4_sm, families = sm.families.Binomial()).fit()
print(log_model_4.summary())

In [None]:
vif = pd.DataFrame()
vif['Features'] = X_4.columns
vif['VIF'] = [variance_inflation_factor(X_4.values, x) for x in range(X_4.shape[1])]
vif['VIF'] = round(vif['VIF'],2)
vif = vif.sort_values(by = 'VIF', ascending = False)
vif

In [None]:
X_5 = X_4.drop('age', axis = 1)
X_5_sm = sm.add_constant(X_5)
log_model_5 = sm.GLM(Y_train, X_5_sm, families = sm.families.Binomial()).fit()
print(log_model_5.summary())

In [None]:
vif = pd.DataFrame()
vif['Features'] = X_5.columns
vif['VIF'] = [variance_inflation_factor(X_5.values, x) for x in range(X_5.shape[1])]
vif['VIF'] = round(vif['VIF'],2)
vif = vif.sort_values(by = 'VIF', ascending = False)
vif

In [None]:
X_6 = X_5.drop('defect', axis = 1)
X_6_sm = sm.add_constant(X_6)
log_model_6 = sm.GLM(Y_train, X_6_sm, families = sm.families.Binomial()).fit()
print(log_model_6.summary())

In [None]:
vif = pd.DataFrame()
vif['Features'] = X_6.columns
vif['VIF'] = [variance_inflation_factor(X_6.values, x) for x in range(X_6.shape[1])]
vif['VIF'] = round(vif['VIF'],2)
vif = vif.sort_values(by = 'VIF', ascending = False)
vif

##### Explanation:
log_model_6 is the final model. The P value for all the features is less than 0.05 (< 0.05) which makes all the features significant in the model. Also the VIF scores for these variables is less than 5. Which means that the features are independant and there is no multicollinearity between them.

## Task 4: Model Evaluation
- ### Subtask 4.1: Finding the Optimal Threshold

In [None]:
# Predicting using the log_model_6
Y_train_pred = log_model_6.predict(X_6_sm)

In [None]:
# Conversion at different probability
cutoff = pd.DataFrame()
cutoff['Actual'] = Y_train.values
cutoff['Pred'] = Y_train_pred.values
num = [float(x/10) for x in range(10)]
for x in num:
    cutoff[x] = cutoff['Pred'].map(lambda i: 1 if i > x else 0)
cutoff.head()

In [None]:
# Calculating various measures.
measures = pd.DataFrame(columns = ['Probability', 'Accuracy', 'Sensitivity', 'FPR', 'Specificity', 'FNR'])
for x in num:
    metrix = metrics.confusion_matrix(cutoff['Actual'], cutoff[x])
    total = sum(sum(metrix))
    Accuracy = (metrix[0,0]+metrix[1,1])/total
    Sensitivity = metrix[1,1]/(metrix[1,1]+metrix[1,0])
    FPR = metrix[0,1]/(metrix[0,1]+metrix[0,0])
    Specificity = metrix[0,0]/(metrix[0,0]+metrix[0,1])
    FNR = metrix[1,0]/(metrix[1,0]+metrix[1,1])
    measures.loc[x] = [x, Accuracy, Sensitivity, FPR, Specificity, FNR]

In [None]:
measures

In [None]:
# Plotting the lines to find the optimal Threshold limit.
measures.plot.line(x = 'Probability', y = ['Accuracy', 'Sensitivity', 'Specificity'])
plt.show()

##### Explanation
The Optimal Thrushold limit is a point where 'Accuracy', 'Sensitivity' and 'Specificity are fairly decent and are almost equal. It is usually the intersection point on the graph. Hence the optimal thrushold limit is 0.6.

- ### Subtask 4.2: ROC- Receiver Operating Characteristic Curve

In [None]:
def roc (actual, prob):
    FPR, TPR, threshold = metrics.roc_curve(actual, prob, drop_intermediate = False)
    auc_score = metrics.roc_auc_score(actual, prob)
    plt.plot(FPR, TPR, label = 'ROC curve (area = %0.2f)' %auc_score)
    plt.legend(loc = 'lower right')
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    plt.title('Receiver operating characteristic Curve')
    plt.show()
    return None

In [None]:
FPR, TPR, threshold = metrics.roc_curve(cutoff['Actual'], cutoff['Pred'], drop_intermediate = False)
roc(cutoff['Actual'], cutoff['Pred'])

##### Explanation
The model has achieved the ROC score of 0.90 which is pretty high and also from the above graph we can observe that the curve is hugging the TPR. This means that the model is able to identify patients who are likely to get a heart attack correctly by reducing the FPR.

##### Solution: Selecting the Cut of point.
Even though the optimal threshold limit was identified at 0.6 we cannot go ahead with this cut off point as it was only able to reach the accuracy and sensitivity of 82%. When it comes to cardiac arrest there is minimal chance of error. Hence factors like accuracy, sensitivity and FNR plays atmost importance. Keeping all this factors in mind, the cut off limit is set at 0.5 at which the model is able to produce the following scores. (Best results)

In [None]:
measures.loc[measures['Probability']== 0.5]

## Task 5: Prediction and Evaluation on Test data
- ### Subtask 5.1: Prediction 

In [None]:
df_test[var] = scaler.transform(df_test[var])
df_test.describe()

In [None]:
# Assigning X and Y
Y_test = df_test.pop('target')
X_test = df_test

In [None]:
# Matching the test data with Log_model_6 columns
cols = X_6.columns
X_test = X_test[cols]

In [None]:
# Prediction on test data
X_test_sm = sm.add_constant(X_test)
Y_test_pred = log_model_6.predict(X_test_sm)

In [None]:
test = pd.DataFrame()
test['Actual'] = Y_test.values
test['Pred'] = Y_test_pred.values
test['Final'] = test['Pred'].map(lambda x: 1 if x >= 0.5 else 0)
test.head()

- ### Subtask 5.2: Evaluation

In [None]:
con = metrics.confusion_matrix(test['Actual'], test['Final'])
con

In [None]:
sensitivity = con[1,1]/(con[1,1]+con[1,0])
specificity = con[0,0]/(con[0,0]+con[0,1])
FNR = con[1,0]/(con[1,0]+con[1,1])

In [None]:
print({'Accuracy': round(metrics.accuracy_score(test['Actual'], test['Final']),2)})
print({'Sensitivity': round(sensitivity, 2)})
print({'Specificity': round(specificity, 2)})
print({'FNR': round(FNR, 2)})

##### Explanation
The model is also performing well on the Test data. This ensures stability of the model.

- ### Subtask 5.3: Probability of getting a heart attack

In [None]:
df_test['target'] = Y_test.values
df_test['probability'] = test['Pred'].values
df_test['final'] = test['Final'].values

In [None]:
df_test.head(10)