# Predicting heart disease by using machine learning
 
this notebook looks into using various python based machine learning and data science  libraries in an attempt to build a machine learning model capable of predicting whether or not someone have heart disease based on their medical attributes


**we are going to take the following approach:**
        
        1.Problem definition
        2.Data
        3.Evaluation
        4.Features
        5.Modeling 
        6.Experimentation

## 1.Problem definition 

in statement,
> Given clinical parameters about a patient can we pridict whether or not they have heart disease

## 2.Data

    the original data come from Cleaveland data from UCI machine learning Repository:https://archive.ics.uci.edu/datasets?search=Heart%20Disease
    
    there is also a version of it available on kaggle: https://www.kaggle.com/datasets/redwankarimsony/heart-disease-data
   
## 3. Evaluation
> if we can reach 95% of accuracy at predicting whether or not a patient has heart disease during the proof of concept, we will uprue the project.

## 4.Features
This is where you will get different information about each of the features in your data.

**creating data dictionary**

1.age in years
2.sex(1=male,0=female)
3.cp chest pain type
   1: typical angina
   2: atypical angina
   3: non-anginal pain
   4: asymptomatic
4.test bp resting blood pressure (in mm Hg on admission on the hospital)
5.cholserum cholestoral in mg/dl
6.fbs (fasting blood sugar > 20 mg/dl)(1=true; 0=false)
7.restecgresting electrocardiographic results
8.thalchmaximum heart rate acheived 
9.oldpeakST depression induced by exercise relative to rest
10.slopethe slope of the peak exercise ST segment 
11.canumber of major vessels (0-3) colored by flourosopy 
12.thal3=normal; 6=fixed defect; 7=reversable defect 
13.target or 0

## Preparing the tools 
we are going to use pandas,matplotlib and numpy for data analysis and manipulation.

In [2]:
#Importing all the tools we need

#Regular EDA (exporatory data analysis) and plotting libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# we want our plots to appear inside the notebook
%matplotlib inline

#Models from Scikit-learn
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier

#Model Evaluation
from sklearn.model_selection import train_test_split,cross_val_score
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.metrics import precision_score,recall_score
from sklearn.metrics import roc_curve, auc


# Load the Data

In [3]:
df=pd.read_csv("heart-disease (2).csv")
df.shape

(303, 14)

In [4]:
df.target

0      1
1      1
2      1
3      1
4      1
      ..
298    0
299    0
300    0
301    0
302    0
Name: target, Length: 303, dtype: int64

In [5]:
df["target"].value_counts()

target
1    165
0    138
Name: count, dtype: int64

In [6]:
df.target.value_counts()

target
1    165
0    138
Name: count, dtype: int64

In [7]:
df["target"].value_counts().plot(kind="bar",color=["salmon","lightblue"]);

In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 303 entries, 0 to 302
Data columns (total 14 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       303 non-null    int64  
 1   sex       303 non-null    int64  
 2   cp        303 non-null    int64  
 3   trestbps  303 non-null    int64  
 4   chol      303 non-null    int64  
 5   fbs       303 non-null    int64  
 6   restecg   303 non-null    int64  
 7   thalach   303 non-null    int64  
 8   exang     303 non-null    int64  
 9   oldpeak   303 non-null    float64
 10  slope     303 non-null    int64  
 11  ca        303 non-null    int64  
 12  thal      303 non-null    int64  
 13  target    303 non-null    int64  
dtypes: float64(1), int64(13)
memory usage: 33.3 KB


In [9]:
df.isna().sum()

age         0
sex         0
cp          0
trestbps    0
chol        0
fbs         0
restecg     0
thalach     0
exang       0
oldpeak     0
slope       0
ca          0
thal        0
target      0
dtype: int64

In [10]:
df.describe()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
count,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0
mean,54.366337,0.683168,0.966997,131.623762,246.264026,0.148515,0.528053,149.646865,0.326733,1.039604,1.39934,0.729373,2.313531,0.544554
std,9.082101,0.466011,1.032052,17.538143,51.830751,0.356198,0.52586,22.905161,0.469794,1.161075,0.616226,1.022606,0.612277,0.498835
min,29.0,0.0,0.0,94.0,126.0,0.0,0.0,71.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,47.5,0.0,0.0,120.0,211.0,0.0,0.0,133.5,0.0,0.0,1.0,0.0,2.0,0.0
50%,55.0,1.0,1.0,130.0,240.0,0.0,1.0,153.0,0.0,0.8,1.0,0.0,2.0,1.0
75%,61.0,1.0,2.0,140.0,274.5,0.0,1.0,166.0,1.0,1.6,2.0,1.0,3.0,1.0
max,77.0,1.0,3.0,200.0,564.0,1.0,2.0,202.0,1.0,6.2,2.0,4.0,3.0,1.0


# Finding the patterns

In [11]:
df.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


In [12]:
df.tail()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
298,57,0,0,140,241,0,1,123,1,0.2,1,0,3,0
299,45,1,3,110,264,0,1,132,0,1.2,1,0,3,0
300,68,1,0,144,193,1,1,141,0,3.4,1,2,3,0
301,57,1,0,130,131,0,1,115,1,1.2,1,1,3,0
302,57,0,1,130,236,0,0,174,0,0.0,1,1,2,0


In [13]:
df.sex.value_counts()

sex
1    207
0     96
Name: count, dtype: int64

In [14]:
#compare target column with sex column 
pd.crosstab(df.target,df.sex)

sex,0,1
target,Unnamed: 1_level_1,Unnamed: 2_level_1
0,24,114
1,72,93


In [15]:
#creating a plot of crosstab
pd.crosstab(df.target,df.sex).plot(kind="bar",
                                  color=["salmon","lightblue"],
                                  figsize=(10,6))
plt.title("Heart Disease Frequency for sex")
plt.xlabel("0=No disease,1=Disease")
plt.ylabel("Amount")
plt.legend(["Female","male"]);
plt.xticks(rotation=0);

In [16]:
df.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


In [17]:
df["thalach"].value_counts()

thalach
162    11
160     9
163     9
152     8
173     8
       ..
202     1
184     1
121     1
192     1
90      1
Name: count, Length: 91, dtype: int64

### Age vs max heart rate for heart disease

In [18]:
#create another figure
plt.figure(figsize=(10,6))

#scatter with positive examples
plt.scatter(df.age[df.target==1],
           df.thalach[df.target==1],
           c="salmon")

#scatter with negative examples
plt.scatter(df.age[df.target==0],
           df.thalach[df.target==0],
           c="lightblue")

#add some helpful information
plt.title("Heart disease in function of age and max heart rate");
plt.xlabel("Age");
plt.ylabel("Max heart rate");
plt.legend(["Disease","No disease"]);

In [19]:
#check the distribution of the age column with histogram
df.age.plot.hist();


### Heart disease frequency per chest pain type


In [20]:
pd.crosstab(df.cp, df.target)

target,0,1
cp,Unnamed: 1_level_1,Unnamed: 2_level_1
0,104,39
1,9,41
2,18,69
3,7,16


In [21]:
#Make the crosstab more visual
pd.crosstab(df.cp,df.target).plot(kind="bar",
                                 figsize=(10,6),
                                 color=["salmon","lightblue"])
# add some communication 
plt.title("Heart disease Frequency per chest pain type")
plt.xlabel("chest pain type")
plt.ylabel("Amount")
plt.legend(["No Disease","Disease"])
plt.xticks(rotation=0);

In [22]:
df.head()


Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


In [23]:
#make a correlation matrix
df.corr()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
age,1.0,-0.098447,-0.068653,0.279351,0.213678,0.121308,-0.116211,-0.398522,0.096801,0.210013,-0.168814,0.276326,0.068001,-0.225439
sex,-0.098447,1.0,-0.049353,-0.056769,-0.197912,0.045032,-0.058196,-0.04402,0.141664,0.096093,-0.030711,0.118261,0.210041,-0.280937
cp,-0.068653,-0.049353,1.0,0.047608,-0.076904,0.094444,0.044421,0.295762,-0.39428,-0.14923,0.119717,-0.181053,-0.161736,0.433798
trestbps,0.279351,-0.056769,0.047608,1.0,0.123174,0.177531,-0.114103,-0.046698,0.067616,0.193216,-0.121475,0.101389,0.06221,-0.144931
chol,0.213678,-0.197912,-0.076904,0.123174,1.0,0.013294,-0.15104,-0.00994,0.067023,0.053952,-0.004038,0.070511,0.098803,-0.085239
fbs,0.121308,0.045032,0.094444,0.177531,0.013294,1.0,-0.084189,-0.008567,0.025665,0.005747,-0.059894,0.137979,-0.032019,-0.028046
restecg,-0.116211,-0.058196,0.044421,-0.114103,-0.15104,-0.084189,1.0,0.044123,-0.070733,-0.05877,0.093045,-0.072042,-0.011981,0.13723
thalach,-0.398522,-0.04402,0.295762,-0.046698,-0.00994,-0.008567,0.044123,1.0,-0.378812,-0.344187,0.386784,-0.213177,-0.096439,0.421741
exang,0.096801,0.141664,-0.39428,0.067616,0.067023,0.025665,-0.070733,-0.378812,1.0,0.288223,-0.257748,0.115739,0.206754,-0.436757
oldpeak,0.210013,0.096093,-0.14923,0.193216,0.053952,0.005747,-0.05877,-0.344187,0.288223,1.0,-0.577537,0.222682,0.210244,-0.430696


In [24]:
#lets make our correlation matrix a little prettier
corr_matrix=df.corr()
fig,ax=plt.subplots(figsize=(15,10))
ax=sns.heatmap(corr_matrix,
              annot=True,
              linewidth=0.5,
              fmt=".2f",
              cmap="YlGnBu")

# 5. Modeling 

In [25]:
df.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


In [26]:
#split data into x and y
x=df.drop("target",axis=1)
y=df["target"]

In [27]:
x

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...
298,57,0,0,140,241,0,1,123,1,0.2,1,0,3
299,45,1,3,110,264,0,1,132,0,1.2,1,0,3
300,68,1,0,144,193,1,1,141,0,3.4,1,2,3
301,57,1,0,130,131,0,1,115,1,1.2,1,1,3


In [28]:
y

0      1
1      1
2      1
3      1
4      1
      ..
298    0
299    0
300    0
301    0
302    0
Name: target, Length: 303, dtype: int64

In [29]:
#split the data into train and test sets
np.random.seed(42)

#split into traun and test set
x_train,x_test,y_train,y_test=train_test_split(x,
                                              y,
                                              test_size=0.2)

In [30]:
x_train

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal
132,42,1,1,120,295,0,1,162,0,0.0,2,0,2
202,58,1,0,150,270,0,0,111,1,0.8,2,0,3
196,46,1,2,150,231,0,1,147,0,3.6,1,0,2
75,55,0,1,135,250,0,0,161,0,1.4,1,0,2
176,60,1,0,117,230,1,1,160,1,1.4,2,2,3
...,...,...,...,...,...,...,...,...,...,...,...,...,...
188,50,1,2,140,233,0,1,163,0,0.6,1,1,3
71,51,1,2,94,227,0,1,154,1,0.0,2,1,3
106,69,1,3,160,234,1,0,131,0,0.1,1,1,2
270,46,1,0,120,249,0,0,144,0,0.8,2,0,3


In [31]:
y_train, len(y_train)

(132    1
 202    0
 196    0
 75     1
 176    0
       ..
 188    0
 71     1
 106    1
 270    0
 102    1
 Name: target, Length: 242, dtype: int64,
 242)

Now we have got our data split into training and test sets, its time to build a machine learning model


we will train it(find the patterns) on the training sets

And we will test it (use the patterns) on the test sets

we are going to try 3 different machine learning models:

1.Logistic Classifier
2.K-Nearest Neighbours Classifier
3.Random Forest Classifier


In [32]:
# put model in a dictionary
models={"Logistic Regression": LogisticRegression(),
       "KNN":KNeighborsClassifier(),
       "Random Forest": RandomForestClassifier()}

#Create a function to fit score models
def fit_and_score(models,x_train,x_test,y_train,y_test):
    """
    Fits and evaluate the given machine learning models.
    models: a dictionary of different scikit learn models.
    x_train:training data (no labels)
    x_test:testing data(no labels)
    y_train:training labels
    y_test:testing labels
    """
    #set random seed
    np.random.seed(42)
    
    #make a dictionary to keep scores and model_scores
    model_scores={}
    for name, model in  models.items():
        #fit the model to the data
        model.fit(x_train,y_train)
        
        #evaluate the model and append its score to the model_Scores
        model_scores[name]=model.score(x_test,y_test)
    return model_scores

In [33]:
model_scores=fit_and_score(models=models,
                          x_train=x_train,
                          x_test=x_test,
                          y_train=y_train,
                          y_test=y_test)
model_scores

STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


{'Logistic Regression': 0.8852459016393442,
 'KNN': 0.6885245901639344,
 'Random Forest': 0.8360655737704918}

### Model comparision

In [34]:
model_compare=pd.DataFrame(model_scores,
                           index=["Accuracy"])
model_compare.T.plot.bar();

Now we have got baseline model and we knwo a model first predictions arent always what we should based our need steps off. what should we do?

lets look at the following
    
    1.Hyperparameter tuning 
    
    2.featuring importance
    
    3.cross validation.
    
    4.Recall
    
    5.F1 score
    
    6.Classification report
    
    7.ROC curve
    
    8.Area under the Curve(AUC)
    

### Hyper Tuning

In [35]:
#lets tune KNN

train_score=[]
test_score=[]

#create a list of different values for n-neighbors 
neighbors=range(1,21)

#set up KNN instance
knn=KNeighborsClassifier()

#loop throght different n-neighbors =i
for i in neighbors:
    knn.set_params(n_neighbors=i)
    
    #fit the algorithm
    knn.fit(x_train,y_train)
    
    #update the training scores list
    train_score.append(knn.score(x_train,y_train))
    
    # update the testing score list 
    test_score.append(knn.score(x_test,y_test))
    
    


In [36]:
test_score

[0.6229508196721312,
 0.639344262295082,
 0.6557377049180327,
 0.6721311475409836,
 0.6885245901639344,
 0.7213114754098361,
 0.7049180327868853,
 0.6885245901639344,
 0.6885245901639344,
 0.7049180327868853,
 0.7540983606557377,
 0.7377049180327869,
 0.7377049180327869,
 0.7377049180327869,
 0.6885245901639344,
 0.7213114754098361,
 0.6885245901639344,
 0.6885245901639344,
 0.7049180327868853,
 0.6557377049180327]

In [37]:
train_score

[1.0,
 0.8099173553719008,
 0.7727272727272727,
 0.743801652892562,
 0.7603305785123967,
 0.7520661157024794,
 0.743801652892562,
 0.7231404958677686,
 0.71900826446281,
 0.6942148760330579,
 0.7272727272727273,
 0.6983471074380165,
 0.6900826446280992,
 0.6942148760330579,
 0.6859504132231405,
 0.6735537190082644,
 0.6859504132231405,
 0.6652892561983471,
 0.6818181818181818,
 0.6694214876033058]

In [38]:
plt.plot(neighbors,train_score,label="Train score")
plt.plot(neighbors, test_score, label="Test score")
plt.xticks(np.arange(1,21,1))
plt.xlabel("Number of neighbors")
plt.ylabel("Model score")
plt.legend();

In [39]:
print(f"Maximum knn score on the test data:{max(test_score)*100:.2f}%")

Maximum knn score on the test data:75.41%


hyperparameter with RandomizedSearchCV


we are going to tune 
    1.LogisticRegression()
    
    2.RandommizedSearchCV()
    
    


In [40]:
#create a hyperparameter grid for LogisticRegression
log_reg_grid={"C":np.logspace(-4,4,20),
             "solver":["liblinear"]}

#create a hyperparameter grid for RandomForestClassifier
rf_grid={"n_estimators":np.arange(10,100,50),
        "max_depth":[None,3,5,10],
        "min_samples_split":np.arange(2,20,2),
        "min_samples_leaf":np.arange(1,20,2)}


Now we have got hyperparameter grids set up for each of our models,
lets tune them using RandomizedSearchCV

In [41]:
#Tune LogisticRegression
np.random.seed(42)

#setup random hyperparameter search for LogisticRegression
rs_log_reg=RandomizedSearchCV(LogisticRegression(),
                            param_distributions=log_reg_grid,
                            cv=5,
                            n_iter=20,
                            verbose=True)

#fit Random hyperparameters search model for LogisticRegression 
rs_log_reg.fit(x_train,y_train)

Fitting 5 folds for each of 20 candidates, totalling 100 fits


In [42]:
rs_log_reg.best_params_

{'solver': 'liblinear', 'C': 0.23357214690901212}

In [43]:
rs_log_reg.score(x_test,y_test)

0.8852459016393442

Now we have tune LogisticRegression(), lets do the same for RandomForestClassfier()

In [44]:
#setup random seed
np.random.seed(42)

#set up random hyperparameter search fot RandomForestClassifier
rs_rf=RandomizedSearchCV(RandomForestClassifier(),
                        param_distributions=rf_grid,
                        cv=5,
                        n_iter=20,
                        verbose=True)
#fit  random hyperparameter search model for RandomForestClassifier()
rs_rf.fit(x_train,y_train)

Fitting 5 folds for each of 20 candidates, totalling 100 fits


In [45]:
#find the best hyperparameters
rs_rf.best_params_

{'n_estimators': 10,
 'min_samples_split': 18,
 'min_samples_leaf': 17,
 'max_depth': 3}

In [46]:
#Evaluate the randomized earcg RandomForestClassifier model
rs_rf.score(x_test,y_test)

0.819672131147541

## Hyperparameter tuning with GridSearchCV
since our LogisticRegression model provides the best scores so far we will try and improve them again using GridSearchCV

In [47]:
#Different hyperparameters for out LogisticRegression model
log_reg_grid={"C":np.logspace(-4,4,30),
             "solver":["liblinear"]}
#set up grid hyperparameter search for LogisticRegression
gs_log_reg=GridSearchCV(LogisticRegression(),
                        param_grid=log_reg_grid,
                        cv=5,
                        verbose=True)
gs_log_reg.fit(x_test,y_test)

Fitting 5 folds for each of 30 candidates, totalling 150 fits


In [48]:
gs_log_reg.score(x_test,y_test)

0.9344262295081968

### Evaluating our model tuned machin learning classifier,beyond accuracy
    1.ROC curve and AUC curve
    2.Confusion matrix
    3.Classification report
    4.Precision
    5.Recall
    6.F1-Score

....and it would be great if cross validation was used where possible

To make comparision and evaluate our trained model first we need to make predictions`

In [50]:
#Make predicitons with tuned model 
y_preds=gs_log_reg.predict(x_test)

In [51]:
y_preds

array([0, 1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 1, 0,
       0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0], dtype=int64)

In [52]:
y_test


179    0
228    0
111    1
246    0
60     1
      ..
249    0
104    1
300    0
193    0
184    0
Name: target, Length: 61, dtype: int64

In [55]:
# confusion matrics
print(confusion_matrix(y_test,y_preds))

[[27  2]
 [ 2 30]]


Now we have got a ROC curve , an AUC metric and confusion matrix, lets get a classification report as well as Cross validated precision,recall and F1-score

In [56]:
#Classification report
print(classification_report(y_test,y_preds))


              precision    recall  f1-score   support

           0       0.93      0.93      0.93        29
           1       0.94      0.94      0.94        32

    accuracy                           0.93        61
   macro avg       0.93      0.93      0.93        61
weighted avg       0.93      0.93      0.93        61



### Calculate evaluation metrics using cross validtion

We are going to calculate precision , recall, F1-score of our model using cross validation and to do so we will using cross_val_score()

In [57]:
#check best hyperparameter
gs_log_reg.best_params_

{'C': 1.3738237958832638, 'solver': 'liblinear'}

In [58]:
#create a new classifier with best parameters 
clf=LogisticRegression(C=1.3738237958832638,
                      solver="liblinear")
clf

In [59]:
#cross validation accuracy 
cv_acc=cross_val_score(clf,
                      x,
                      y,
                      cv=5,
                      scoring="accuracy")
cv_acc

array([0.81967213, 0.86885246, 0.85245902, 0.85      , 0.71666667])

In [60]:
cv_acc=np.mean(cv_acc)
cv_acc

0.8215300546448088

In [61]:
#cross validated Precision
cv_precision=cross_val_score(clf,
                            x,
                            y,
                            cv=5,
                            scoring="precision")
cv_precision=np.mean(cv_precision)
cv_precision

0.817900063251107

In [62]:
#cross validated recall
cv_recall=cross_val_score(clf,
                            x,
                            y,
                            cv=5,
                            scoring="recall")
cv_recall=np.mean(cv_recall)
cv_recall

0.8727272727272727

In [63]:
#cross validated recall
cv_f1=cross_val_score(clf,
                      x,
                      y,
                      cv=5,
                      scoring="f1")
cv_f1=np.mean(cv_f1)
cv_f1

0.8431741323998502

### Feature Importance

Feature Importance is an as asking ,"whcih feature contributed most to the outcomes of the model and how did they coontributed"

Finding Feature Importance is different for each machine learning model(MODEL NAME)

Lets find the feature Importance for our logistic Regression model.....

In [64]:
#Fit an instance of the logisticRegression 
gs_log_reg.best_params_

{'C': 1.3738237958832638, 'solver': 'liblinear'}

In [65]:
clf.fit(x_train,y_train)
#check coef-
clf.coef_

array([[ 0.00633081, -1.42862514,  0.80456196, -0.01289764, -0.00249104,
         0.14858522,  0.54777956,  0.02597541, -0.93224226, -0.61770483,
         0.68963642, -0.7866042 , -0.87760738]])

In [66]:
#match coef's of feature to column
feature_dict=dict(zip(df.columns,list(clf.coef_[0])))
feature_dict

{'age': 0.0063308117701570375,
 'sex': -1.4286251374088468,
 'cp': 0.8045619595767074,
 'trestbps': -0.012897636496037002,
 'chol': -0.0024910362385091632,
 'fbs': 0.14858521913979394,
 'restecg': 0.5477795639285691,
 'thalach': 0.025975407384727145,
 'exang': -0.932242256481015,
 'oldpeak': -0.6177048333294851,
 'slope': 0.689636422678003,
 'ca': -0.7866041950669852,
 'thal': -0.8776073783675938}

In [67]:
#visualize Feature Importance
feature_df=pd.DataFrame(feature_dict,index=[0])
feature_df.T.plot.bar(title="Feature Importance",
                     legend=False);

In [68]:
pd.crosstab(df["sex"],df["target"])

target,0,1
sex,Unnamed: 1_level_1,Unnamed: 2_level_1
0,24,72
1,114,93


In [69]:
pd.crosstab(df["slope"],df["target"])

target,0,1
slope,Unnamed: 1_level_1,Unnamed: 2_level_1
0,12,9
1,91,49
2,35,107


# 6.Experimentation

IF you have not hit your evaluation metrics yet....ask yourself
    
    * 1.could you collect more data?
    
    * 2.could you try better model?like CatBoost or XGboost?
    
    * 3.could you improve the current models?(beyond what we have done so far)
    
    * 4.If your model is good enough (you have hit your evaluation metric) how would you export it?