# Heart Disease Study

This database contains 76 attributes, but all published experiments refer to using a subset of 14 of them. In particular, the Cleveland database is the only one that has been used by ML researchers to 
this date. The "goal" field refers to the presence of heart disease in the patient. It is integer valued from 0 (no presence) to 4. Experiments with the Cleveland database have concentrated on simply attempting to distinguish presence (values 1,2,3,4) from absence (value 0).

## Importing libraries and reading dataset

Firstly, we are going to import only library for manipulating data. Sklearn's modules won't be loaded for now, just on the second part.

In [None]:
#Main libraries to work with the data
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

#set plots to show without the need of plt.show()
%matplotlib inline

#setting seaborn's plots styles
sns.set_style("darkgrid")
sns.set_palette("colorblind")

#avoid showing warnings
import warnings
warnings.filterwarnings('ignore')

In [None]:
data = pd.read_csv("../input/heart.csv")
data.head() #5 first rows

## General interpretation of dataset

The dataset contains the following features:
1. age: in years
2. sex: (1 = male; 0 = female)
3. cp: chest pain type
4. trestbps: resting blood pressure (in mm Hg on admission to the hospital)
5. chol: serum cholestoral in mg/dl
6. fbs: (fasting blood sugar > 120 mg/dl) (1 = true; 0 = false)
7. restecg: resting electrocardiographic results
8. thalach: maximum heart rate achieved
9. exang: exercise induced angina (1 = yes; 0 = no)
10. oldpeak: ST depression induced by exercise relative to rest
11. slope: the slope of the peak exercise ST segment
12. ca: number of major vessels (0-3) colored by flourosopy
13. thal: 1 = normal; 2 = fixed defect; 3 = reversable defect
> The original info for this feature is: "A blood disorder called thalassemia (3 = normal; 6 = fixed defect; 7 = reversable defect)" but the values in the dataset are 0 (unknown, 2 instances), 1, 2 and 3.
14. target: 0 = no (healthy), 1 = yes (sick)

### Stringifying Columns to Facilitate Analysis

In [None]:
data["sex_s"] = data["sex"].map({0: "female", 1: "male"})
data["cp_s"] = data["cp"].map({0: "typical angina", 1: "atypical angina", 2: "non-anginal pain", 3: "asymptomatic"})
data["fbs_s"] = data["fbs"].map({0: "<= 120 mg/dl", 1: "> 120 mg/dl"})
data["restecg_s"] = data["restecg"].map({0: "normal", 1: "abnormal", 2: "dangerous"})
data["exang_s"] = data["exang"].map({0: "no", 1: "yes"})
data["slope_s"] = data["slope"].map({0: "upsloping", 1: "flat", 2:"downsloping"})
data["target_s"] = data["target"].map({0: "healthy", 1: "sick"})

data["is_sick"] = data["target"]
data.drop(["target"],axis=1,inplace=True)

## First glance at the data and understanding the target
After stringifying the categorical columns and renaming the target to "is_sick", let's see how the dataset is currently and first touch the "is_sick" column.

In [None]:
data.head(3)

In [None]:
#Columns infos
data.info()

In [None]:
#General description of the numerical values, such as mean, median, std etc.
data.describe()

In [None]:
#Correlation matrix
#More about correlation:
#https://stats.stackexchange.com/questions/18082/how-would-you-explain-the-difference-between-correlation-and-covariance
plt.figure(figsize=(16,8))
sns.heatmap(data=data.corr(),annot=True,cmap="viridis")
plt.title("Correlation Matrix")

**Highest correlated factors to target:**
* Chest pain (cp)
* Maximum heart rate achieved (thalach)
* Exercise induced angina (exang)
* ST depression induced by exercise relative to rest (oldpeak)

**Worth checking**
* Colored vessels by flourosopy (ca)
* Slope (slope)
* Sex (sex)
* Age (age)

**Lowest correlated factors to target:**
* Fasting blood sugar (fbs)
* Serum cholesterol (chol)
* Resting blood pressure (trestbps)

In [None]:
#Checking target distribution
sns.countplot(x="target_s",data=data)

## Exploratory Data Analysis (EDA)
### Chest Pain (cp)

In [None]:
fig, ax = plt.subplots(3,2,figsize=(16,12))
sns.boxplot(x="cp_s",y="age",data=data,ax=ax[0][0])
sns.boxplot(x="cp_s",y="age",hue="target_s",data=data,ax=ax[0][1])
sns.boxplot(x="cp_s",y="trestbps",data=data,ax=ax[1][0])
sns.boxplot(x="cp_s",y="thalach",data=data,ax=ax[2][0])
sns.boxplot(x="cp_s",y="trestbps",hue="target_s",data=data,ax=ax[1][1])
sns.boxplot(x="cp_s",y="thalach",hue="target_s",data=data,ax=ax[2][1])

In [None]:
fig, ax = plt.subplots(1,2,figsize=(16,3))
sns.countplot(x="cp_s",hue="target_s",data=data,ax=ax[0])
sns.barplot(x="cp_s",y="is_sick",data=data,ax=ax[1])

This graphs show that healthy usually have the chest pain value "typical angina". For the other cases, even asymptomatic, there are more sick people than healthy. The difference between sick and healthy for non-anginal pain and atypical angina are interesting to notice.

We'll work with three strategies and see which performs better: 
1. Cp as the three values we already have;
2. Typical angina = 0, the rest = 1;
3. Typical angina and asymptomatic = 0, non-anginal and atypical angina = 1.

In [None]:
#Engineering two new features
group0 = ["typical angina"]
def group_pain(pain):
    return int(pain not in group0)

data["cp_typ_x_rest"] = data["cp_s"].apply(group_pain)
group0.append("asymptomatic")
data["cp_typ_&_asymp_x_rest"] = data["cp_s"].apply(group_pain)

In [None]:
data.head(2) #checking if the two new columns were added

### Maximum Heart Rate Achieved (thalach)

In [None]:
fig, ax = plt.subplots(2,2,figsize=(16,8))
sns.distplot(data["thalach"],ax=ax[0][0])
ax[0][0].set_title("Distribution over the dataset")
sns.kdeplot(data[data["target_s"] == "sick"]["thalach"],ax=ax[0][1],color="red",label="sick")
sns.kdeplot(data[data["target_s"] == "healthy"]["thalach"],ax=ax[0][1],color="green",label="healthy")
ax[0][1].set_title("Distribution over the dataset separated by sick and healthy people")
sns.distplot(data[data["sex_s"] == "female"]["thalach"],ax=ax[1][0],color="orange")
ax[1][0].set_title("Distribution for women")
sns.distplot(data[data["sex_s"] ==   "male"]["thalach"],ax=ax[1][1],color="blue")
ax[1][1].set_title("Distribution for men")
plt.tight_layout()

As expected, the **maximum heart rate** achieved is **higher** for **sick people**. There isn't a significant difference for the values of men and women.

### Exercise induced angina (exang)

In [None]:
fig, ax = plt.subplots(1,2,figsize=(16,4))
sns.countplot(x="exang",data=data,ax=ax[0]) #0 - not induced / #1 - induced
ax[0].set_title("Distribution over the dataset")
sns.countplot(x="target_s",hue="exang",data=data,ax=ax[1])
ax[1].set_title("Distribution over the dataset separated by sick and healthy people")

### Oldpeak

In [None]:
sns.kdeplot(data[data["target_s"] ==    "sick"]["oldpeak"],color="red",label="sick")
sns.kdeplot(data[data["target_s"] == "healthy"]["oldpeak"],color="green",label="healthy")

The values for the **oldpeak** tend to be **lower** to **sick people**.

## Quick look on features labeled as "worth checking"
### Age and Sex

In [None]:
#Age is a continuous value, so a histogram is appropriate.
sns.distplot(data["age"])
data["age"].describe()

In [None]:
data.loc[data["age"] == 29]

The **youngest** person in this dataset is 29 years old and already **has a heart disease**. I better call my doctor...

In [None]:
sns.countplot(x="sex_s",hue="target_s",data=data,palette="magma")
print("Males   in dataset: {}".format(data.loc[data["sex_s"] == "male",:].shape[0]))
print("Females in dataset: {}".format(data.loc[data["sex_s"]=="female",:].shape[0]))

In [None]:
fig, ax = plt.subplots(2,2,figsize=(16,8))
#0,0
sns.kdeplot(
    data[(data["sex_s"] == "female") & (data["target_s"] == "sick")]["age"],
    color="yellow",shade=True,ax=ax[0][0])
sns.kdeplot(
    data[(data["sex_s"] == "female") & (data["target_s"] == "healthy")]["age"],
    color="violet",shade=True,ax=ax[0][0])
ax[0][0].legend(labels=("female_sick","female_healthy"))
#0,1
sns.kdeplot(
    data[(data["sex_s"] == "male") & (data["target_s"] == "sick")]["age"],
    color="red",shade=True,ax=ax[0][1])
sns.kdeplot(
    data[(data["sex_s"] == "male") & (data["target_s"] == "healthy")]["age"],
    color="blue",shade=True,ax=ax[0][1])
ax[0][1].legend(labels=("male_sick","male_healthy"))
#1,0
data["age_cats"] = pd.cut(data["age"],bins=[28,40,50,60,100])
sns.countplot(x="age_cats",data=data,ax=ax[1][0])
#1,1
sns.barplot(x="age_cats",y="is_sick",data=data,ax=ax[1][1])
ax[1][1].set_title("Number of sick and healthy people for each age category")

This is quite a surprise. Even though we have the majority of people beyond 50, these people are not the sickiest... At least not for this dataset

## Modeling
### Feature Importance
First step is to use an ensemble algorithm to check feature importance.

Reference: https://scikit-learn.org/stable/auto_examples/ensemble/plot_forest_importances.html

In [None]:
from sklearn.ensemble import ExtraTreesClassifier

print("List of all features: {}".format(data.columns))

In [None]:
def feature_importances(dataset,features_list,target,test_size=.25,random_state=14):
    """
    Wrap-up function to train an ExtraTreesClassifier and return a descending ordered feature importance list
    """
    etc = ExtraTreesClassifier(n_estimators=50)
    etc.fit(dataset[features_list],dataset[target])
    fi = pd.DataFrame(data=etc.feature_importances_,index=X,columns=["Feature Importance"])
    return fi.sort_values(by="Feature Importance",ascending=False)

In [None]:
#Using the original 13 features
X = ['age','sex','cp','trestbps','chol','fbs','restecg','thalach','exang','oldpeak','slope','ca','thal']
y = 'is_sick'

feature_importances(data,X,y)

In [None]:
#Using cp as the 1st engineered feature
X = ['age','sex','cp_typ_x_rest','trestbps','chol','fbs','restecg','thalach','exang','oldpeak','slope','ca','thal']
y = 'is_sick'

feature_importances(data,X,y)

In [None]:
X = ['age','sex','cp_typ_&_asymp_x_rest','trestbps','chol','fbs','restecg','thalach','exang','oldpeak','slope',
     'ca','thal']
y = 'is_sick'

feature_importances(data,X,y)

In [None]:
X = ['age','sex','cp','cp_typ_x_rest','cp_typ_&_asymp_x_rest','trestbps','chol','fbs','restecg','thalach','exang',
     'oldpeak','slope','ca','thal']
y = 'is_sick'

feature_importances(data,X,y)

Extra trees confirms chest pain as one of the most important features. Sex perhaps has low importance here due to the unbalanced proportion of men and women (70% are men approx.). We already had a clue that fbs, chol and trestbps had low impact. Age having only a mid weight surprises me though.

## Testing different algorithms

In [None]:
#Base algorithms, no ensembling for now
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB

#Cross validation
from sklearn.model_selection import cross_validate, GridSearchCV, train_test_split

#Scaling
from sklearn.preprocessing import StandardScaler, MinMaxScaler, MaxAbsScaler, RobustScaler

#Metrics to evaluate models
#General metrics for classifiers
from sklearn.metrics import classification_report, confusion_matrix 

#Metrics for precision/recall trade-off (more of this later in this notebook)
from sklearn.metrics import roc_curve, roc_auc_score, precision_recall_curve, precision_score, recall_score

import time #built-in library to measure time

In [None]:
def autotrain(X,y,scoring="accuracy",cv_split=5,title=""):
    """
    Performs cross validation of defined base models and presents results as a dataframe sorted by best test scores.
    Adapted from LD Freeman's kernel:
    https://www.kaggle.com/ldfreeman3/a-data-science-framework-to-achieve-99-accuracy
    """
    #define base training models.
    models = [KNeighborsClassifier(), SVC(gamma="auto"), LogisticRegression(solver="liblinear"), 
              DecisionTreeClassifier(), GaussianNB()]
    
    #create a dataframe to store training information and display after all iterations are finished.
    results = pd.DataFrame(columns=["Algorithm","Base Estimator",
                                    "Train Time","Train Score", "Test Score","Scaling Method"])
    
    print(title) #title for the resulting dataframe
    for i,model in enumerate(models):
        #define scalers to try
        scalers = [StandardScaler(),MinMaxScaler(),MaxAbsScaler(),RobustScaler()]
        results.loc[i,"Algorithm"] = model.__class__.__name__
        training = cross_validate(model,X,y,cv=cv_split,scoring="accuracy",return_train_score=True) #
        results.loc[i,"Base Estimator"] = str(model)
        results.loc[i,"Train Time"] = training["fit_time"].sum()
        results.loc[i,"Train Score"] = training["train_score"].mean()
        results.loc[i,"Test Score"] = training["test_score"].mean()
        results.loc[i,"Scaling Method"] = "Unscaled"
        #print("Model: {}".format(model.__class__.__name__))
        #print("Testing Score (unscaled): {}".format(training["test_score"].mean()))    
        for scaler in scalers:
            X_scaled = scaler.fit_transform(X)
            training = cross_validate(model,X_scaled,y,cv=cv_split,scoring="accuracy",return_train_score=True)
            #print("Testing Score ({}): {}".format(scaler.__class__.__name__,training["test_score"].mean()))
            if training["test_score"].mean() > results.loc[i,"Test Score"]:
                results.loc[i,"Train Score"] = training["train_score"].mean()
                results.loc[i,"Test Score"] = training["test_score"].mean()
                results.loc[i,"Scaling Method"] = scaler.__class__.__name__
        #print("*"*50)
        
    return results.sort_values(by="Test Score",ascending=False)

In [None]:
X = data[['age','sex','trestbps','chol','fbs','restecg','thalach','exang','oldpeak','slope','ca','thal','cp']]
y = data['is_sick']
autotrain(X,y,title="Model features with all 13 original features")

In [None]:
X = data[['age','sex','trestbps','chol','fbs','restecg','thalach','exang','oldpeak','slope','ca','thal',
          'cp_typ_x_rest']]

autotrain(X,y,title="Model features including engineered feature for chest pain typical angina X rest")

In [None]:
X = data[['thalach','exang','slope','ca','thal','cp']]

autotrain(X,y,title="Model features with the best features (according to ExtraTreesClassifier) already existing")

In [None]:
X = data[['thalach','slope', 'ca','thal','cp_typ_x_rest']]

autotrain(X,y,title="Model features with the best features (according to ExtraTreesClassifier) and engineered cp")

**Highest Test Scores (Accuracy)**
1. SVC with (all) existing best RobustScaled features -> .8446
2. SVC with engineered best RobustScaled features -> .8445
3. Logistic Regression with (all) existing StandardScaled features -> .8349

It is important to notice that for these 3 results listed above, if we compare:
* 1 with SVC w/ RobustScaler and all existing features, the train score has **dropped**.
* 2 with SVC w/ RobustScaler and engineered cp and all the remaining features, the train score has **dropped**.

So, we clearly see that an increase of (not so useful) features led to **overfitting**, i.e., the models were performing excelent on the train set but failing to predict unseen data. The **Accuracy x Error** for training and testing can be seen below. We can see that an increase on the number of features (increase of model **complexity**) is good until certain point. After that, the model starts to (over)fit too much known data, which means that it is capturing **unwanted noise** instead of **generalizing**. The exclamation mark indicates the **optimal point**.

<img src="https://upload.wikimedia.org/wikipedia/commons/f/fc/Overfitting.png",width=30%>
_Source: Wikipedia commons_

## Train Test Splitting to check Recall

In [None]:
X_svm1 = RobustScaler().fit_transform(data[['thalach','exang','slope','ca','thal','cp']])
X_svm2 = RobustScaler().fit_transform(data[['thalach','slope', 'ca','thal','cp_typ_x_rest']])
X_lreg = StandardScaler().fit_transform(data[['age','sex','trestbps','chol','fbs','restecg','thalach','exang',
                                                'oldpeak','slope','ca','thal','cp']])

y = data['is_sick']

svm1_X_train, svm1_X_test, y_train, y_test = train_test_split(X_svm1, y, test_size=.3, random_state=14)
svm2_X_train, svm2_X_test, y_train, y_test = train_test_split(X_svm2, y, test_size=.3, random_state=14)
lreg_X_train, lreg_X_test, y_train, y_test = train_test_split(X_lreg, y, test_size=.3, random_state=14)

svm1 = SVC(gamma="auto")
svm2 = SVC(gamma="auto")
lreg = LogisticRegression(solver="liblinear")

svm1.fit(svm1_X_train,y_train)
svm2.fit(svm2_X_train,y_train)
lreg.fit(lreg_X_train,y_train)

print("Support-vector classifiers and logistic regression trained with 70% of the dataset randomly chosen")
print("Number of instances for training: {}".format(lreg_X_train.shape[0]))
print("Number of instances for testing: {}".format(lreg_X_test.shape[0]))

In [None]:
fig, ax = plt.subplots(1,3,figsize=(16,4))

svm1_predict = svm1.predict(svm1_X_test)
svm1_conf_matrix = confusion_matrix(y_test,svm1_predict)

svm2_predict = svm2.predict(svm2_X_test)
svm2_conf_matrix = confusion_matrix(y_test,svm2_predict)

lreg_predict = lreg.predict(lreg_X_test)
lreg_conf_matrix = confusion_matrix(y_test,lreg_predict)

labels = ("healthy","sick")

sns.heatmap(svm1_conf_matrix,   annot=True,cmap="coolwarm_r",xticklabels=labels,yticklabels=labels,ax=ax[0])
ax[0].set_title("Support-Vector Classifier #1")
ax[0].set_ylabel("Actual Values", fontsize=16)
ax[0].set_xlabel("Predicted Values", fontsize=16)
#(with RobustScaler & only best existing features)

sns.heatmap(svm2_conf_matrix,   annot=True,cmap="coolwarm_r",xticklabels=labels,yticklabels=labels,ax=ax[1])
ax[1].set_title("Support-Vector Classifier #2")
ax[1].set_xlabel("Predicted Values", fontsize=16)
#(with RobustScaler & engineered cp and best features)

sns.heatmap(lreg_conf_matrix, annot=True,cmap="coolwarm_r",xticklabels=labels,yticklabels=labels,ax=ax[2])
ax[2].set_title("Logistic Regressor Classifier")
ax[2].set_xlabel("Predicted Values", fontsize=16)
#(with StandardScaler & all 13 existing features)

SVC #1 predicted 75/91 = 82.4% cases correctly. SVC #2 predicted 72/91 = 79% cases correctly. LogReg predicted 76/91 = 83.5% cases correctly. Although differences seem small, we're going to perform **hyperparameters tuning** only in SVC #1 and the Logistic Regressor. This is due the fact that SVC #2 is predicting almost twice as much **False Negatives (FN)** as the other two models. In plain english, this means that SVC #2 is telling twice as much sick patients that they are healthy. We want to be **conservative** and minimize this kind of error. It is better to predict that a healthy person is sick and perform more tests on them than send sick people home.

According to sklearn's documentation:
The recall is the ratio $TP  / (TP + FN)$ where $TP$ is the number of **true positives** and $FN$ the number of **false negatives**. The recall is intuitively the **ability** of the classifier **to find all the positive samples**. <br>(Source: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.recall_score.html)


## Interpreting the Classification Report

In [None]:
print("Classification report for Support-Vector Classifier #1".upper())
print("-"*60)
print(classification_report(y_test,svm1_predict,target_names=labels))

From the confusion matrix, we know that: 
* $TP = ActuallySick = 44$
* $FP = PredictedSickButIsHealthy = 11$
* $TN = ActuallyHealthy = 31$
* $FN = PredictedHealthyButIsSick = 5$

### Precision: $\frac{TP}{TP+FP}$

$Precision(sick) = \frac{ActuallySick}{ActuallySick + PredictedSickButIsHealthy} = \frac{44}{44+11} = 80.0\%$
<br><br>
$Precision(healthy) = \frac{ActuallyHealthy}{ActuallyHealthy + PredictedHealthyButIsSick} = \frac{31}{31+5} = 86.1\%$
### Recall (or sensitivity): $\frac{TP}{TP+FN}$
$Recall(sick) = \frac{ActuallySick}{ActuallySick + PredictedHealthyButIsSick} = \frac{44}{44+5} = 89.8\%$
<br><br>
$Recall(healthy) = \frac{ActuallyHealthy}{ActuallyHealthy + PredictedSickButIsHealthy} = \frac{31}{31+11} = 73.8\%$
___
Our goal is to minimize the False Negatives (FN) for **safety** reasons. The _Predicted Healthy but is Sick_ in this test set is 5. Ideally pushing it to zero would lead to $Precision(healthy) = 1$ and $Recall(sick) = 1$.

## Improving our Model: Tuning Hyperparameters
For this step, we're going to use the GridSearchCV function. We can create a dictionary with all the hyperparams. we wanna try and the grid search will try every possible combination of those. In the end, it returns the best performing model.

In [None]:
#SVM Tuning

params_svm = {
    "kernel": ["rbf", "linear"],
    "C": np.logspace(-5,3,9),
    "gamma": np.logspace(-4,-1,4),
    "decision_function_shape": ["ovo", "ovr"],
    "random_state": [41],
}

# In each fold, the dataset will be splitted 75/25
grid1 = GridSearchCV(SVC(probability=True),iid=False,param_grid=params_svm,cv=4) 

start = time.perf_counter()
grid1.fit(X_svm1,y)
end = time.perf_counter()

print("SVM tuning amount of seconds elapsed: {:.2f}".format(end-start))
print("Best parameters found for this model: {}".format(grid1.best_params_))

In [None]:
svm_tuned = grid1.best_estimator_

svm_tuned_predict = svm_tuned.predict(X_svm1) #checking overall performance for the tuned model
svm_tuned_conf_matrix = confusion_matrix(y,svm_tuned_predict)

sns.heatmap(svm_tuned_conf_matrix,annot=True,cmap="coolwarm_r",xticklabels=labels,yticklabels=labels,fmt="1")
plt.xlabel("Predicted Values")
plt.ylabel("Actual Values")
plt.title("SVM after hyperparameters tuning")

In [None]:
print("Classification report for Support-Vector Classifier (Tuned) #1".upper())
print("-"*60)
print(classification_report(y,svm_tuned_predict,target_names=labels))

In [None]:
#LogReg Tuning

params_lreg = {
    "C": np.logspace(-5,3,9),
    "solver": ("liblinear","lbfgs","sag","saga","newton-cg"),
    "fit_intercept": (True,False),
    "random_state": [41],
}

# In each fold, the dataset will be splitted 75/25
grid2 = GridSearchCV(LogisticRegression(),iid=False,param_grid=params_lreg,cv=4)

start = time.perf_counter()
grid2.fit(X_lreg,y)
end = time.perf_counter()

print("LogReg tuning amount of seconds elapsed: {:.2f}".format(end-start))
print("Best parameters found for this model: {}".format(grid2.best_params_))

In [None]:
print("Classification report for Logistic Regression".upper())
print("-"*60)
print(classification_report(y_test,lreg_predict,target_names=labels))
print("\n")
lreg_tuned = grid2.best_estimator_
lreg_tuned_predict = lreg_tuned.predict(X_lreg)
print("Classification report for Logistic Regression (Tuned) #1".upper())
print("-"*60)
print(classification_report(y,lreg_tuned_predict,target_names=labels))

## A new approach for improvement: Adjusting Threshold
According to Kevin Arvai: 
>The **precision_recall_curve** and **roc_curve** are useful tools to visualize the **sensitivity-specificty tradeoff** in the classifier. They help inform a data scientist where to set the **decision threshold** of the model to maximize either sensitivity or specificity. This is called the “operating point” of the model.

In [None]:
#SVC
y_preds = svm_tuned.predict(svm1_X_test)
y_scores_svm = svm_tuned.predict_proba(svm1_X_test)[:,1]

pred_and_proba = pd.DataFrame(data={"Final Prediction": y_preds, "Proba": y_scores_svm})

In [None]:
#Checking if threshold is .5
pred_and_proba.loc[(pred_and_proba["Proba"] > .4) & (pred_and_proba["Proba"] < .6),:]

In [None]:
svm_prec, svm_rec, svm_t = precision_recall_curve(y_test,y_scores_svm)
y_scores_lreg = lreg_tuned.predict_proba(lreg_X_test)[:,1]
lreg_prec, lreg_rec, lreg_t = precision_recall_curve(y_test,y_scores_lreg)

fig, ax = plt.subplots(1,2,figsize=(16,3))
plt.sca(ax[0])
plt.step(svm_rec,svm_prec,where="post",alpha=.5,color="r")
plt.fill_between(svm_rec,svm_prec,step="post",alpha=.2,color="r")
plt.xlim(.79,1.001)
plt.xlabel("Recall",fontsize=14)
plt.ylabel("Precision",fontsize=14)
plt.title("Recall vs. Precision",fontsize=18)
plt.sca(ax[1])
plt.step(lreg_rec,lreg_prec,where="post",color="b")
plt.fill_between(lreg_rec,lreg_prec,step="post",alpha=.2,color="b")
plt.xlim(.79,1.001)
plt.xlabel("Recall",fontsize=14)
plt.ylabel("Precision",fontsize=14)
plt.title("Recall vs. Precision",fontsize=18)

We can see that in a recall of almost 1 (.98 or .97) we can achieve a precision of .78 for the SVC and .75 for LogReg. If a recall of .94 is acceptable, logreg has a better precision, of .81, against .79 from SVC. Let's try to find both threshold just to compare.

In [None]:
plt.plot(np.arange(0,svm_t.shape[0]),svm_t,color="r")
plt.plot(np.arange(0,lreg_t.shape[0]),lreg_t,color="b")

The threshold for Logistic Regression is a lot more sensitive, i.e., a small change in its value can modify considerably the outcome. This looks unstable. Let's proceed with SVC and find a suitable threshold.

In [None]:
#SVM
for threshold in np.arange(0,1.05,.05):
    y_adj = [1 if y >= threshold else 0 for y in y_scores_svm]
    print("SVM: For threshold of {:.2f}, precision is {:.3f} and recall is {:.3f}".format(
    threshold, precision_score(y_test,y_adj), recall_score(y_test,y_adj)))

In [None]:
plt.figure(figsize=(8, 8))
plt.title("Precision and Recall Scores as a function of the decision threshold")
plt.plot(svm_t, svm_prec[:-1], "b--", label="Precision")
plt.plot(svm_t, svm_rec[:-1], "g-", label="Recall")
plt.ylabel("Score",fontsize=14)
plt.xlabel("Decision Threshold",fontsize=14)
plt.legend(loc='best')

We can see that the recall curve decreases slower than the increase of precision. The optimal point is between .6 and .7, but the recall for it is around .85. Not good enough. I believe we can settle a decision threshold somewhere .41, .42 or .43. Let's test.

In [None]:
for threshold in np.arange(.4,.46,.01):
    y_adj = [1 if y >= threshold else 0 for y in y_scores_svm]
    print("SVM: For threshold of {:.2f}, precision is {:.3f} and recall is {:.3f}".format(
    threshold, precision_score(y_test,y_adj), recall_score(y_test,y_adj)))

In [None]:
#Confusion matrix for threshold = .42
threshold = .42
y_adj = [1 if y >= threshold else 0 for y in y_scores_svm]
sns.heatmap(confusion_matrix(y_test,y_adj),annot=True,fmt="1")

In [None]:
#Classification Report for threshold = .42
print(classification_report(y_test,y_adj))

## Next steps / suggestions:
1. Use sklearn.ensemble classifiers;
2. Use sklearn.feature_selection tools to improve selecting features;
3. Build, with sklearn.pipeline, a black-box model to fit and predict data with pre-fixed threshold.