In [None]:
import os,warnings;warnings.filterwarnings("ignore")
import numpy as np;import pandas as pd;import matplotlib.pyplot as plt
import seaborn as sns;sns.set(style='whitegrid')
%matplotlib inline

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# 1. Introduction
***

## 1.1. Problem End Goal
***

The goal of the problem is to predict the target variable, called `condition`. It has only two unique associated values, so let's treat this a `classification problem`

## 1.2. What is covered?
***
- A `Classification` problem, using the `UCI Dataset`. 
- A simple `Gaussian Process Regressor` turned `classifier` is used for our model; `sklearn` compatible class is that incorporates a simple ensemble weighting for the `posterior` prediction.
- Influence of uneven weight allocation for `categorical` features. The dataset already contains a predefined `categorical` conversion into `numerical` subset. A pooly allocated weighting for categorical features can negatively affect the model performance.

## 1.3. Problem Feature Description
***

**Feature List:**
`age`
`sex`
`cp`
`trestbps`
`chol`
`fbs`
`restecg`
`thalach`
`exang`
`oldpeak`
`slope`
`ca`
`thal`
`condition` <br>
**Feature Types:** All features are `numerical features`, however some of them are converted `categorical features`, we'll know which is which when looking into EDA.

<div class="alert alert-block alert-info">
<b>1. age </b> | Number of years a person has lived <br>
<b>2. sex |</b> Gender of patient (Male:1/Female:0)  <br>
<b>3. cp | </b> Chest Pain type (4 values) <br>
<b>4. trestbps |</b> Resting Blood Pressure <br>
<b>5. chol |</b> serum cholestoral in mg/dl <br>
<b>6. fbs |</b> Fasting Blood Sugar > 120 mg/dl <br>
<b>7. restecg |</b> Resting Electrocardiographic (ECG) results (values 0,1,2) <br>
<b>8. thalach |</b> Maximum Heart Rate Achieved <br>
<b>9. exang |</b> Exercise Induced Angina <br>
<b>10. oldpeak |</b> oldpeak = ST depression induced by exercise relative to rest <br>
<b>11. slope |</b> the slope of the peak exercise ST segment <br>
<b>12. ca |</b> number of major vessels (0-3) colored by flourosopy <br>
<b>13. thal |</b> Thalium stress test results: 3 = normal; 6 = fixed defect; 7 = reversable defect 
<br><br>
<b>Target Variable</b>, from [original dataset](https://archive.ics.uci.edu/ml/datasets/Heart+Disease)<br>
condition: diagnosis of heart disease (angiographic disease status)<br>
<b>Value 0:</b> < 50% diameter narrowing (negative for disease) <br>
<b>Value 1:</b> > 50% diameter narrowing (positive for disease)
</div><br>

## 1.4. Indepth Feature Description Space
***

Some interesting excerts are placed in this section

**ECG Related Features** : `restecg`,`oldpeak`,`slope`
<div class="alert alert-block alert-info">
    
**restecg**
> Resting electrocardiography (ECG) is a non-invasive test that can detect abnormalities including arrhythmias, evidence of coronary heart disease, left ventricular hypertrophy and bundle branch blocks. [Reference](https://www.ncbi.nlm.nih.gov/books/NBK367910/)

`0: normal`,`1: having ST-T wave abnormality`,`2:showing probable or definite left ventricular hypertrophy by Estes' criteria`</div><br>

**Blood Related Features** : `trestbps`,`thalach`,`fbs`,`chol`
<div class="alert alert-block alert-info">
    
**trestbps: Resting Blood Pressure**
> Stress on the blood vessels makes people with hypertension more prone to heart disease, peripheral vascular disease, heart attack, stroke, kidney disease and aneurysms. Correspondingly, chronic conditions such as diabetes, kidney disease, sleep apnea and high cholesterol increase the risk for developing high blood pressure. [Reference](https://www.rush.edu/health-wellness/discover-health/6-high-blood-pressure-facts)

**thalach: Maximum Heart Rate**
> It has been shown that an increase in heart rate by 10 beats per minute was associated with an increase in the risk of cardiac death by at least 20%, and this increase in the risk is similar to the one observed with an increase in systolic blood pressure by 10 mm Hg. [Reference](https://pubmed.ncbi.nlm.nih.gov/19615487/#:~:text=It%20has%20been%20shown%20that,pressure%20by%2010%20mm%20Hg.)

**fbs: Fasting Blood Sugar**
> A test that measures blood sugar levels. Elevated levels are associated with diabetes and insulin resistance, in which the body cannot properly handle sugar (e.g. obesity). [Reference](https://my.clevelandclinic.org/health/diagnostics/16790-blood-sugar-tests)

**chol: serum cholestoral**
> The conventional view is that having high LDL cholesterol levels increases your risk of dying of cardiovascular diseases, such as heart disease. [Reference](https://www.nhs.uk/news/heart-and-lungs/study-says-theres-no-link-between-cholesterol-and-heart-disease/)
</div><br>


**Pain & Defect Related Features** : `cp`,`exang`,`thal`,`ca`
<div class="alert alert-block alert-info">

**exang:  Exercise Induced Angina**
> Angina is chest pain caused by reduced blood flow to the heart muscles. It's not usually life threatening, but it's a warning sign that you could be at risk of a heart attack or stroke. [Reference](https://www.nhs.uk/conditions/angina/)

**ca: number of major vessels (0-3) colored by flourosopy**
> However, number of major vessels colored by fluoroscopy had a medium effect on the CAD diagnosis that could be due to the small sample size of the study population. It might be related to the fact that the sensitivity of fluoroscopy could be as low as 35% in some cases,[86] and the system learned it from the training set.
[Reference](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4468223/)
<br>

## 1.5. Model Selection
***

- For this problem, we will need a model which has the ability to classify. It's possibly to use a `regression` model, and add a few lines of code which changes the `regressor` prediction to the nearest class. The `regressor` format is used to utilise `ensemble` model weighting, which often is said to improve predictions. 
- Let's use the `Gaussian Process` Regressor, introduced in a [previous notebook](https://www.kaggle.com/andreikozlov/uci-airfoil-noise-prediction) & add a simple `classifier` that finds nearest unique class associated to the `posterior mean` prediction.
- The model doesn't incorporate `probability` prediction, so `ROC` & `PR` curves, which are generally useful, but aren't looked at. 
- Instead the model is used with `GridSearchCV` & manual model parameter (`hyperparameters`) selection to prevent severe overfitting.

# 2. Exploratory Data Analysis (EDA)
***

The dataset contains quite a lot of features to choose from for building our model(s) but we first need to investigate the dataset. The feature dataset can be rougtly divided into `general patient`,`ECG`,`blood`,`pain` related features.

## 2.1. Initial Dataset Impression
***

In [None]:
df = pd.read_csv('/kaggle/input/heart-disease-cleveland-uci/heart_cleveland_upload.csv')
df.info()

In [None]:
df.head()

In [None]:
display(df.describe())

- `info()` suggests we have a full dataset w/ no missing values
- `head()` suggests the `categorical` features described above are already converted for us

## 2.2. Correlation Matrix
***
We have a lot of features, let's investigate the `linear correlation` of these features, they are all formally numerical, all `categorical` features are already converted for us.

In [None]:
''' Plot a Shifted Correlation Matrix '''
# Diagonal correlation is always unity & less relevant, shifted variant shows only relevant cases
def corrMat(df,id=False):
    
    corr_mat = df.corr().round(2)
    f, ax = plt.subplots(figsize=(12,7))
    mask = np.triu(np.ones_like(corr_mat, dtype=np.bool))
    mask = mask[1:,:-1]
    corr = corr_mat.iloc[1:,:-1].copy()
    sns.heatmap(corr,mask=mask,vmin=-0.3,vmax=0.3,center=0, 
                cmap='plasma',square=False,lw=2,annot=True,cbar=False)
#     bottom, top = ax.get_ylim() 
#     ax.set_ylim(bottom + 0.5, top - 0.5) 
    ax.set_title('Shifted Linear Correlation Matrix')
    
corrMat(df)

`condition` variable has a relatively broad range of correlated features:  <br>
***
- Amongst the higher positively correlated features;`thal`,`ca`,`oldpeak`,`exang`,`cp`; (0.52,0.46,0.42,0.41) <br>
- Only one feature is negatively correlated to the target variable; `thalach` (-0.42)
- The only feature that has little to no linear correlation to target variable:`fbs` (0.0)

In [None]:
''' CountPlot Histograms '''

plt4 = ['tab:blue','tab:orange']
def plot1count(x,xlabel,palt):
    
    plt.figure(figsize=(20,2))
    sns.countplot(x=x,hue='condition', data=df, palette=palt)
    plt.legend(["<50% diameter narrowing", ">50% diameter narrowing"],loc='upper right')
    plt.xlabel(xlabel)
    plt.ylabel('Frequency')
    plt.show()
    
def plot1count_ordered(x,xlabel,order,palt):
    
    plt.figure(figsize=(20,2))
    sns.countplot(x=x,hue='condition',data=df,order=order,palette=palt)
    plt.legend(["<50% diameter narrowing", ">50% diameter narrowing"],loc='upper right')
    plt.xlabel(xlabel)
    plt.ylabel('Frequency')
    plt.show()

def plot2count(x1,x2,xlabel1,xlabel2,colour,rat,ind1=None,ind2=None):
    
    # colour, ratio, index_sort

    fig,ax = plt.subplots(1,2,figsize=(20,3),gridspec_kw={'width_ratios':rat})
    # Number of major vessels (0-3) colored by flourosopy
    sns.countplot(x=x1,hue='condition',data=df,order=ind1,palette=colour,ax=ax[0])
    ax[0].legend(["<50% diameter narrowing", ">50% diameter narrowing"],loc='upper right')
    ax[0].set_xlabel(xlabel1)
    ax[0].set_ylabel('Frequency')

    # Defect Information (0 = normal; 1 = fixed defect; 2 = reversable defect )
    sns.countplot(x=x2,hue='condition', data=df,order=ind2,palette=colour,ax=ax[1])
    ax[1].legend(["<50% diameter narrowing", ">50% diameter narrowing"],loc='best')
    ax[1].set_xlabel(xlabel2)
    ax[1].set_ylabel('Frequency')
    plt.show()
    
''' Plot n Countplots side by side '''
def nplot2count(lst_name,lst_label,colour,n_plots):
    
    ii=-1;fig,ax = plt.subplots(1,n_plots,figsize=(20,3))
    for i in range(0,n_plots):
        ii+=1;id1=lst_name[ii];id2=lst_label[ii]
        sns.countplot(x=id1,hue='condition',data=df,palette=colour,ax=ax[ii])
        ax[ii].legend(["<50% diameter narrowing", ">50% diameter narrowing"],loc='upper right')
        ax[ii].set_xlabel(id2)
        ax[ii].set_ylabel('Frequency')

### 2.3. Feature Bivariate Histograms
***

Bivariate histograms are split into several categories of similar similarity `General patient`,`ECG`,`Blood`,`Pain` related features, hue for both `condition` is compared.

#### 2.3.1. `General Features` & `Pain Related Features`
***

In [None]:
plot2count('age','sex','Age of Patient','Gender of Patient',plt4,[2,1])
lst1 = ['cp','exang','thal','ca']
lst2 = ['Chest Pain Type','Excersised Induced Angina','Thalium Stress Result','Fluorosopy Vessels']
nplot2count(lst1,lst2,plt4,4)

**Age & Sex; Linear Correlation (0.23,0.28)** <br>
Age group between [29,54]; increase in frequency for a `<50%` cases `target` value. <br>
Age group between [55,63]; distinctly larger proportion of `>50%`<br>
Aside from a couple of patients in the age group 71 to 76, `>50%` is much more populated for the higher age group, therefore identifying them are the higher risk patients<br>
`Sex` Distribution suggests; `male` are more likely to have an association with `>50%` <br> 
`women` patient also had much more even distribution of `>50%` & `<50%` cases, compared to `male`.

**Chest Pain Type; Linear Correlation (-0.41)**<br>
`>50%` patients actually are associated with no chest pain symptoms as is seen in the data.

**Excercise Induced Angina; Linear Correlation (0.42)** <br>
Higher values of `exang` are associated with higher values of `condition` (`>50%`) <br>

**Major Coloured Vessels; Linear Correlation (0.46)** <br>
Higher values of coloured `major vessels` are associated with target variabe `>50%` <br>

#### 2.3.2. `ECG Related Features`
***

In [None]:
lst_ecg = ['oldpeak','restecg','slope','condition']
plot1count('oldpeak','oldpeak: ST Depression Relative to Rest',plt4)
plot2count('restecg','slope','restecg: Resting electrocardiography (ECG)','slope: []',plt4,[1,1])

**oldpeak**<br>
Feature `oldpeak` shows a distinct association of `>50%` to higher`oldpeak` values above 1.0 <br>
Patients w/ `<50%` tend to have an old peak of 0.0, with some slight increase possible, however very rarely above 2.0

**Resting Electrocardiography (ECG)** <br>
Most `<50%` patients have normal `ECG`, however a large portion also show probable/definite left ventricular hypertropy <br>
However a large portion of `>50%` patients also show normal `ECG`, but more are are asociated with a more probable/definite left ventricular hypertropy result.

**Slope** <br>
Upsloping tends to be associated with `<50%` patients <br>
Flat slopes are more associated with `>50%` patients <br>
Very little variation variation exists for downsloping

#### 2.3.3. `Blood Related Features`
***

In [None]:
lst_blood = ['trestbps','thalach','fbs','chol','condition']
plot1count('trestbps','trestbps: Resting Blood Pressure (mmHg)',plt4)
plot1count_ordered('thalach','thalach: Maximum Heart Rate',df['thalach'].value_counts().iloc[:30].index,plt4)
plot2count('fbs','chol','Fasting Blood Sugar','Serum Cholestoral',plt4,[2,10],None,df['chol'].value_counts().iloc[:40].index)

**Resting Blood Pressure; Linear Correlation (0.15)**<br>
Distinctive peaks are visible in data, 110,120,130,140,150,160 mmHg; likely associated with common values. <br>
Lower pressure in the range 94-108 mmHg tends to be associated with a target `<50%` patients <br>
A rather spread out relation for `>50%` is noted, but tends to slightly favour higher values of `trestbps`, esp. 160+ mmHg

**Maximum Heart Rate, thalach; Linear Correlation (-0.42)**<br>
`>50%` cases are thus more associated with lower maximum heart rate values<br>
The histogram data shows the higher frequency cases; we note a large number of peaks for `>50%` which are quite low (125,132,144,150), `<50%` cases are more associated with higher values, 150+, indicative of the `correlation` value.

**Fasting Blood Sugar, fbs; Linear Correlation (0)**<br>
Most patients have a `fbs` below 120, but there seems no relation to `condition` for a value higher than 120.

**Serum Cholestoral, chol, Linear Correlation (0.08)**<br>
A wide range of `chol` values are measured ranging from 126 to 564.<br>
The top most common values are almost all higher than 200, which is associate with an elevated value.

**Numerical/Categorical Division**
***

Having revewed the features via `bivariate` histograms, we can define a `numerical/categorical` feature split:

<div class="alert alert-block alert-info">
    
- Numerical Features**: age,oldpeak,trestbps,thalach,chol <br>

- Categorical Features**: sex,restecg,slope,fbs,cp,exang,thal,ca
</div>
<br>
    
- As we have a few `categorical` features that have been converted into `numerical` for us, we should question whether each of the categories is weighted correctly, as they are simply ordered from 0. 
- Higher values may indicate a larger weight, which may not be correct when features are used in the model. So let's opt for a `OneHotEncoding` approach to split these `categorical features` into separate features.

In [None]:
lst_ohe_feat = ['sex','restecg','slope','fbs','cp','exang','thal','ca']
lst_ohe_out = []
for i in lst_ohe_feat:
    tdf = pd.get_dummies(df[i],i)
    lst_ohe_out.append(tdf)
    
lst_ohe_out.append(df['condition'])
df_ohe = pd.concat(lst_ohe_out,axis=1) # One Hot Encoding Features df

## 2.4. Bivariate PairGrids
***
Pairgrids often reveal something interesting in pairwise feature relations, plotting a separate `hue` for both `condition` subsets.

In [None]:
''' Draw a Bivariate Seaborn Pairgrid /w KDE density w/ '''
def snsPairGrid(df):

    ''' Plots a Seaborn Pairgrid w/ KDE & scatter plot of df features'''
    g = sns.PairGrid(df,diag_sharey=False,hue='condition')
    g.fig.set_size_inches(13,13)
    g.map_upper(sns.kdeplot,n_levels=5)
    g.map_diag(sns.kdeplot, lw=2)
    g.map_lower(sns.scatterplot,s=30,edgecolor="k",linewidth=1,alpha=0.6)
    g.add_legend()
    plt.tight_layout()

In [None]:
# We actually have only 5 continuous numerical features, the rest are categorical numbers
numvars_targ = ['age','trestbps','chol','thalach','oldpeak','condition']
snsPairGrid(df[numvars_targ])

`Condition` : `0:<50%`, `1:>50%`
***
- A lot of overlap in variables in the features, most notably `chol` & `trestbps` look quite similar in distribution
- `oldpeak`,`thalach`,`age` are features with most variation in target variable `condition`.

`oldpeak` is an interesting feature where a large variation in `condition` exists:
- a lot of `>50%` cases have an elevated oldpeak
- age group isn't really a factor in elevated `oldpeak` values
- `chol` levels don't seem too out of the ordinary for these elevated cases
- `thalach` levels tend to be lower, two visibles centres emerge on the KDE relation, indicating a linear relation; (

# 3. Model Building
***
- We have a quite a lot of features we can look into when building a model that predicts `condition`
- EDA revealed our features can be split in a few subgroups, let's investigate whether it's possible to build models based on `ECG`,`Blood`,`Patient Features` alone, as well as a mixture of them. The original dataset `categorical` featuers are used.

- EDA also revealed our `numerical` features do contain converted `categorical` features, implying they have been allocated a specific value starting from zero. These values can be misinterpreted by the model, thus `OHE` will be employed. If `OHE` shows an improvement, it's likely the `standard` feature allocation for categorical values is negatively affecting the model performance.

- Previously outlined, `GP` is significantly prone to overfitting, it's very easy to get a very high scores on training data, but have very poor test data accuracy. So a decent model should have similar accuracy on all tested data.
- All models are built on the assumption; `sigma_n=0.01`

## 3.1. Gaussian Process (GP) Classifier, GPC()

The <b>GPC()</b> class contains a very simplistic classification addition to the <b>Regressor</b> model based on the closest distance to each class. <br>
The GP model has the ability to accurately adapt to data it is provided in a high dimensional space. How the model behaves is dependent on its <b>hyperparameter</b> selection: <br><br>
Each <b>hyperparameter</b> has it's own role in how it changes the model, with its core being the <b>covariance matrix</b>; defining all of the instance relation weights in a neat matrix format. <b>Multiple</b> covariance matrices have to be constructed to make a prediction, including matrix inversions, yet GP is the cheaper variant of all models associated with it. <br><br>
<b>GP</b> has several <b>hyperparameters</b> which must be set to train the model, three are set `theta`,`sigma`,`sigma_n`.
***
- Two `hyperaparameters` are associated with the `covariance function` (`kernel` if you prefer); `theta` & `sigma`, this function is used to define all weights in the `covariance matrix`. These functions are used in both `variance` & `covariance` parts of the `covariance matrix`. These functions can be set in whatever combination suits your problem.
- The last, `sigma_n` is a hyperparameter associated with the diagonal term in the `covariance matrix`, influencing the `variance` component only. Implying how relevant the training nodes are; (noise/noiseless) assumption.
    
**Model Instantiation Options, activates `__init__` content:**
***

**Hyperparameters** <br>
- `self.theta` is the `covariance function` associated `hyperparameter`, similar for the other two. 
- `__init__` sets the parameters to a default value (`theta=10`,`sigma=10`,`sigma_n=0.01`) if not set; GPR(). 
- You can set them manually GPC(theta=1,sigma=1,sigma_n=0.01,opt=False), but `opt=False` must be present to prevent `hyperparameters` to be overwritten. <br>

**Other options**
- `self.opt` is the previously mentioned activator for `objective function` optimisation.
- `GPC.kernel` is a common class variable of GPR, defining the type of `covariance function` used.
- `self.mu_in` is an import for previously calculated posterior mean predictions 
- `self.se_alp` is the ensemble coefficient (current prediction multiplier)
- `self.se_bet` is the ensemble coefficient (imported prediction multiplier)

**Training the GPC() model**, `.fit(X,y)`
***
- Setting `hyperparameters` & calculating the training <b>covariance matrix</b>` self.Kmat`
- Specific to the `classifier`, all unique target values are defined
- Hyperparameters can be both set in the manner outlined above, or tuned based on a specific `objective function`. 
- Various `objective functions` exist, the full `likelihood` & simplified variant is included, 
- Scipy's `optimize.minimize` class is used together with the `L-BFGS-B` approach to find the minimum.

**Making a prediction using the GPC() model**, `.predict(X)`
***
- The `Covariance Matrix` for Training & Test Feature Matrices needs to be calculated.
- Commonly referred to as the `posterior mean` is the main model prediction output.
- Specific to the `classifier`, the nearest class to each prediction is found.

In [None]:
from sklearn.base import BaseEstimator,ClassifierMixin
from numpy.linalg import cholesky, det, lstsq, inv, eigvalsh, pinv
from scipy.optimize import minimize
pi = 4.0*np.arctan(1.0)

# Usage similar to any sklearn model
class GPC(BaseEstimator,ClassifierMixin):

    ''' Class Instantiation Related Variables '''
    # With just the one class specific GPC.kernel
    def __init__(self,kernel='rbf',theta=10.0,sigma=10.0,sigma_n=1.0,opt=True,mu_in=None,se_alp=0.5,se_bet=0.5):
        self.theta = theta            # Hyperparameter associated with covariance function
        self.sigma = sigma            #                       ''
        self.sigma_n = sigma_n        # Hyperparameter associated with cov.mat's diagonal component
        self.opt = opt                # Update hyperparameters with objective function optimisation
        GPC.kernel = kernel           # Selection of Covariance Function, class specific instantiation
        self.mu_in = mu_in            # option to import model prediction
        self.se_alp = se_alp          # ensemble coefficient (current prediction multiplier)
        self.se_bet = se_bet          # ensemble coefficient (imported prediction multiplier)

    ''' local covariance functions '''
    # Covariance Functions represent a form of weight adjustor in the matrix W/K
    # for each of the combinations present in the feature matrix
    @staticmethod
    def covfn(X0,X1,theta=1.0,sigma=1.0):

        ''' Radial Basis Covariance Function '''
        if(GPC.kernel is 'rbf'):
            r = np.sum(X0**2,1).reshape(-1,1) + np.sum(X1**2,1) - 2 * np.dot(X0,X1.T)
            return sigma**2 * np.exp(-0.5/theta**2*r)

        ''' Matern Covariance Class of Funtions '''
        if(GPC.kernel is 'matern'):
            lid=2
            r = np.sum(X0**2,1)[:,None] + np.sum(X1**2,1) - 2 * np.dot(X0,X1.T)
            if(lid==1):
                return sigma**2 * np.exp(-r/theta)
            elif(lid==2):
                ratio = r/theta
                v1 = (1.0+np.sqrt(3)*ratio)
                v2 = np.exp(-np.sqrt(3)*ratio)
                return sigma**2*v1*v2
            elif(lid==3):
                ratio = r/theta
                v1 = (1.0+np.sqrt(5)*ratio+(5.0/3.0)*ratio**2)
                v2 = np.exp(-np.sqrt(5)*ratio)
                return sigma**2*v1*v2
        else:
            print('Covariance Function not defined')
    
    ''' Train the GP Classifier Model'''
    def fit(self,X,y):
        
        # Two Parts Associated with base GP Model:
        # - Hyperaparemeter; theta, sigma, sigma_n selection
        # - Definition of Training Covariance Matrix
        # Both are recalled in Posterior Prediction, predict()
        
        ''' Working w/ numpy matrices'''
        if(type(X) is np.ndarray):
            self.X = X;self.y = y
        else:
            self.X = X.values; self.y = y.values
        self.ntot,ndim = self.X.shape
  
        ''' Define Class Labels '''
        self.class_labels = np.unique(self.y)

        ''' Optimisation Objective Function '''
        # Optimisation of hyperparameters via the objective funciton
        def llhobj(X,y,noise):
            
            # Simplified Objective Function
            def llh_dir(hypers):
                K = self.covfn(X,X,theta=hypers[0],sigma=hypers[1]) + noise**2 * np.eye(self.ntot)
                return 0.5 * np.log(det(K)) + \
                    0.5 * y.T.dot(inv(K).dot(y)).ravel()[0] + 0.5 * len(X) * np.log(2*pi)

            # Full Likelihood Equation
            def nll_full(hypers):
                K = self.covfn(X,X,theta=hypers[0],sigma=hypers[1]) + noise**2 * np.eye(self.ntot)
                L = cholesky(K)
                return np.sum(np.log(np.diagonal(L))) + \
                    0.5 * y.T.dot(lstsq(L.T, lstsq(L,y)[0])[0]) + \
                    0.5 * len(X) * np.log(2*pi)

            return nll_full # return one of the two, simplified variant doesn't always work well

        ''' Update hyperparameters based on set objective function '''
        if(self.opt==True):
            # define the objective funciton
            objfn = llhobj(self.X,self.y,self.sigma_n)
            # search for the optimal hyperparameters based on given relation
            res = minimize(objfn,[1,1],bounds=((1e-5,None),(1e-5, None)),method='L-BFGS-B')
            self.theta,self.sigma = res.x # update the hyperparameters to 

        ''' Get Training Covariance Matrix, K^-1 '''
        Kmat = self.covfn(self.X,self.X,self.theta,self.sigma) \
                 + self.sigma_n**2 * np.eye(self.ntot) # Covariance Matrix (Train/Train)
        self.IKmat = pinv(Kmat) # Pseudo Matrix Inversion (More Stable)
        return self  # return class & use w/ predict()

    ''' Posterior Prediction;  '''
    # Make a prediction based on what the model has learned (hyperparameter selection & training weights)
    def predict(self,Xm):
        
        # Covariance Matrices x2 required; (Train/Test&Train/Test)
        mtot = Xm.shape[0]  # Number of Test Matrix Instances
        K_s = self.covfn(self.X,Xm,self.theta,self.sigma)  # Covariance Matrix (Train/Test)               
        self.mu_s = K_s.T.dot(self.IKmat).dot(self.y)      # Posterior Mean Prediction of current model
        
        # Ensemble Modified Posterior Prediction
        if(self.mu_in!=None): 
            
            lntot = self.mu_in[0].shape[0];lmtot = self.mu_in[1].shape[0]
            if(self.mu_s.shape[0]==lntot): j=0
            else: j=1
            loc_mu_s = self.se_alp * self.mu_s + self.se_bet * self.mu_in[j]
            lc = [self.class_labels[np.abs(self.class_labels - x).argmin()] for x in loc_mu_s]
            return np.array(lc)
        
        # Standard Posterior Prediction 
        else:
            # Find the nearest class label to predicted model value, list
            lc = [self.class_labels[np.abs(self.class_labels - x).argmin()] for x in self.mu_s]
            return np.array(lc)
        
''' Sample Usage for GPC() '''
from sklearn.datasets import load_iris
from sklearn.dummy import DummyClassifier as DC
iris = load_iris();X = iris.data;y = iris.target 
model = DC(strategy="most_frequent");model.fit(X,y);print(f'DC(): {model.score(X,y).round(2)}')
model = GPC();model.fit(X,y);print(f'GPC(): {model.score(X,y)}')

In [None]:
''' Draw a single Heatmap using Seaborn '''
def heatmap1(values,xlabel,ylabel,xticklabels,yticklabels,
             cmap='plasma',vmin=None,vmax=None,fmt="%0.2f",title=None):

    fig, ax = plt.subplots(figsize=(5,5))
    sns.heatmap(values, ax=ax,cmap=cmap,cbar=True)
    
    img = ax.pcolor(values, cmap=cmap, vmin=vmin, vmax=vmax)
    img.update_scalarmappable()
    ax.set_xlabel(xlabel);ax.set_ylabel(ylabel)
    ax.set_xticks(np.arange(len(xticklabels)) + 0.5)
    ax.set_yticks(np.arange(len(yticklabels)) + 0.5)
    ax.set_xticklabels(xticklabels);ax.set_yticklabels(yticklabels)
    ax.set_title(title)
    ax.set_aspect(1)
    
    for p, color, value in zip(img.get_paths(), img.get_facecolors(),img.get_array()):
        x, y = p.vertices[:-2, :].mean(0)
        if np.mean(color[:3]) > 0.5:
            c = 'k'
        else:
            c = 'w'
        ax.text(x, y, fmt % value, color=c, ha="center", va="center")
        
''' Plot Several Seaborn Heatmaps Side by Side '''
def heatmapn(n,values,labels,ticklabels,titles,
              cmap='plasma',vmin=None,vmax=None,fmt="%0.2f"):

    ii=-1
    fig,ax = plt.subplots(1,n,figsize=(15,5))
    for i in range(0,n):
        ii+=1
        tval = values[ii];ttitle = titles[ii]
    
        sns.heatmap(tval,ax=ax[ii],cmap=cmap,cbar=True) 
        img = ax[ii].pcolor(tval, cmap=cmap, vmin=vmin, vmax=vmax)
        img.update_scalarmappable()
        ax[ii].set_xlabel(labels[0]);ax[ii].set_ylabel(labels[1])
        ax[ii].set_xticks(np.arange(len(ticklabels[0])) + 0.5)
        ax[ii].set_yticks(np.arange(len(ticklabels[1])) + 0.5)
        ax[ii].set_xticklabels(ticklabels[0]);ax[ii].set_yticklabels(ticklabels[1])
        ax[ii].set_title(ttitle)
        ax[ii].set_aspect(1)
    
        # color of each matrix content
        for p, color, value in zip(img.get_paths(), img.get_facecolors(),img.get_array()):
            x, y = p.vertices[:-2, :].mean(0)
            if np.mean(color[:3]) > 0.5:
                c = 'k'
            else:
                c = 'w'
            ax[ii].text(x, y, fmt % value, color=c, ha="center", va="center")

## 3.2. ECG Based Model
***

- EDA investigation into ECG features revealed they have a good mix of `correlation` values to the target variable `condition`; `restecg(0.17)`,`oldpeak(0.42)`,`slope(0.33)`.
- Let's investigate if it's viable to implement `condition` prediction based on subset feature data
- Then, since we have a few features, we can incorporate `PolynomialFeatures()` of a high order without drastically increasing the computational cost, `StandardScaler()` as well is added in the `Poly()` Pipeline.

### 3.2.1. GPC Model w/ GridSearchCV (manual theta,sigma) Search
***
The assumption is that the our data has a relatively low noise level & training data are with minimal errors, `sigma_n=0.01`

In [None]:
from sklearn.model_selection import GridSearchCV,cross_val_score

lst_theta = [0.01, 0.1, 1, 10, 100, 1000, 5000]
lst_sig = [0.01, 0.1, 1, 10, 100, 1000, 5000]

def modelEval(ldf,lst_theta,lst_sig,feature='condition'):

    # Given a dataframe, split feature/target variable
    X = ldf.copy()
    y = ldf[feature].copy()
    del X[feature]
    
    # define parameters for gridsearch (theta,sigma)
    param_grid = {'theta': lst_theta,'sigma': lst_sig}
    
    # split dataset into 5 segments, fit & predict fo each segment
    model = GPC(opt=False)  # manual hyperparameter model
    model.fit(X,y)

    gscv = GridSearchCV(model,param_grid,cv=5) # 5 fold CV
    gscv.fit(X.values,y.values)
    results = pd.DataFrame(gscv.cv_results_) 
    scores = np.array(results.mean_test_score).reshape(7,7)
    
    # plot the cross validation mean scores of the 5 fold CV
    heatmap1(scores,xlabel='theta',xticklabels=param_grid['theta'],
                    ylabel='sigma',yticklabels=param_grid['sigma'])
    
ldf1 = df[lst_ecg] # subset of ecg features
modelEval(ldf1,lst_theta,lst_sig)

### 3.2.2. GPC Model w/ GridSearchCV (theta,sigma) Search + PolynomialFeatures()
***
Additional to the previous assumption, effects of `PolynomialFeatures()` of 2nd & 7th order are investigated in a `Pipeline()` w/ `StandardScaler()`.

In [None]:
from sklearn.pipeline import make_pipeline,Pipeline
from sklearn.preprocessing import PolynomialFeatures,StandardScaler

lst_theta = [0.01, 0.1, 1, 10, 100, 1000, 5000]
lst_sig = [0.01, 0.1, 1, 10, 100, 1000, 5000]

# Model Evaluation Function for Polynomial Feature Pipeline
def modelEval2(ldf,lst_theta,lst_sig,feature='condition'):

    # Given a dataframe, split feature/target variable
    y = ldf[feature].copy()
    X = ldf.copy()
    del X[feature]     # remove target variable
    
    tlst = []
    for i in [2,7]:
    
        # create a pipeline combining a polynomial feature 
        pipe = Pipeline(steps=[('scaler',StandardScaler()),
                               ('poly',PolynomialFeatures(i)),
                               ('model',GPC(opt=False))])

        # pipepines require slightly different notations w/ __
        param_grid = {'model__theta': lst_theta,'model__sigma': lst_sig}

        gscv2 = GridSearchCV(pipe,param_grid,cv=10)
        gscv2.fit(X,y)
        ypred = gscv2.predict(X)
        results2 = pd.DataFrame(gscv2.cv_results_)
        scores2 = np.array(results2.mean_test_score).reshape(7,7)
        tlst.append(scores2)
    
    lst_lab = ['theta','sigma'];lst_tit = ['Poly(2)','Poly(7)']
    lst_tick = [param_grid['model__theta'],param_grid['model__sigma']]
    
    # Plot two Heatmaps side by side for the two Polynomial
    heatmapn(n=2,values=tlst,labels=lst_lab,ticklabels=lst_tick,titles=lst_tit)

modelEval2(ldf1,lst_theta,lst_sig)

### 3.2.3. GPC Model (theta=10,sigma=100) w/ Train_test_split()
***
Having conducted a cross validation, let's determine the score for a 70/30 split using `theta=10`,`sigma=100`

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix

def modelEval3(ldf,hyp,pred_upd,feature='condition'):

    # Given a dataframe, split feature/target variable
    y = ldf[feature].copy()
    X = ldf.copy()
    del X[feature]     # remove target variable
    
    X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.3,random_state=73)

    lst_mu = []
    if(pred_upd!=None):
        model = GPC(theta=hyp[0],sigma=hyp[1],opt=False,mu_in=pred_upd)
    else:
        model = GPC(theta=hyp[0],sigma=hyp[1],opt=False)

    model.fit(X_train,y_train)
    model.predict(X_train.values);lst_mu.append(model.mu_s)
    model.predict(X_test.values);lst_mu.append(model.mu_s)
    print(f'Training Score: {model.score(X_train.values,y_train.values)}')
    print(f'Test Score: {model.score(X_test.values,y_test.values)}')
    
    if(pred_upd==None):
        return lst_mu
            
lst_ldf1 = modelEval3(ldf1,hyp=[10,100],pred_upd=None)

**Summarising: ECG Based Models**
<div class="alert alert-block alert-info">
    
- ECG features whilst being important, don't seem like the only features that are needed to create a reasonable model for the prediction of target variable `condition`. The addition of more features will likely improve the model
    
- The `ECG` feature model reached only a peak mean cross validation score of 0.72 using 10-fold CV
    
- The addition of polynomial features didn't improve the model
    
- Standard `train_test_split` resulted in a very similar training and test scores (0.74,0.72)
<br>

## 3.3. Blood Related Feature Model & Simple Ensemble
***
- Features `trestbps`,`thalach`,`fbs`,`chol` are available relating to blood samples, let's use the previously used function `modelEval` to build a basic `CrossValidationCV` model using them.
- Let's also try a simple `ensemble` approach, using the exported `posterior mean` from the `ECG` model & import it into the new model. 
- Like ECG, the correlation of various associated features is quite widespread, not too low or highly correlated to target variable, `condition`, so perhaps we might get a better model.

In [None]:
''' Cross Validation '''
lst_theta = [10,100, 500, 1000, 1500, 2000, 2500]
lst_sig = [0.01,0.1,1.0,10,50,100, 500]

ldf2 = df[lst_blood]
modelEval(ldf2,lst_theta,lst_sig)

In [None]:
''' Train/Test Split w/  '''
lst_ldf2 = modelEval3(ldf2,hyp=[1000,50],pred_upd=None)

In [None]:
''' Ensemble Modification Train/Test Split'''
# lst_ldf1 : ECG based model prediction
modelEval3(ldf2,hyp=[1000,50],pred_upd=lst_ldf1)

**Summarising: Blood Related Feature Model & Simple Ensemble**

<div class="alert alert-block alert-info">

- The standalone Blood Related Feature Model did slightly worse than the `ECG` model created earler
    
- Optimal values of `theta` lie slightly higher than the previous model, whilst `sigma` was roughtly idential to the previous model.
    
- An ensemble approach, using the `ECG` model output was used in the input of the `Blood Related` Model, which resulted in the highest model score up to now (0.76)

## 3.4. Patient Related Features + ECG Model
***
`Age` & `Sex` are interesting features which may add value to the model, but not enough by itself. The two features are added to the ECG feature model features to create a new model.

In [None]:
lst = ['age','sex'] 
ldf3 = df[lst+lst_ecg]
modelEval(ldf3,lst_theta,lst_sig)

- `Age` & `Sex` didn't actually seem to add anything to improve the `ECG` model

## 3.5. Feature Type Based Models
****

- Now having built a few models based on the original feature matrix encoding for `categorical` feaures (0 onwads), we can see that there tends to be a limit beyond which it's a little difficult to exceed. Even `scaling` has little to no impact as implemented in a `PolynomialFeature()` pipeline.

- We have two sets of feature types; `numerical` and `categorical`. `Categorical` features were applied a OHE modification, creating new feautres for every unique feature case. The two models are evaluated individually & ensemble approach is attempted.

In [None]:
''' Numerical Features Model '''
lst_theta = [100, 500, 1000, 1500, 2000, 2500,3000]
lst_sig = [500,1000,1500,2000,2500,3000,3500]

df_num = df[numvars_targ].copy()
modelEval(df_num,lst_theta,lst_sig)

In [None]:
lst_ldf3 = modelEval3(df_num,hyp=[1500,1500],pred_upd=None)

In [None]:
''' Categorical Feature Model '''
lst_theta = [10,100, 500, 1000, 1500, 2000, 2500]
lst_sig = [0.01,0.1,1.0,10,50,100, 500]
modelEval(df_ohe,lst_theta,lst_sig)

In [None]:
''' Ensemble Modification of Categorical Feature Model ''' 
modelEval3(df_ohe,hyp=[100,50],pred_upd=lst_ldf3)

**Summarising: Feature Type Based Models**

<div class="alert alert-block alert-info">

- The numerical feature only, consisting of a mixture of various feature types performed very similar to other tested models, peaking at a cross validation mean score of 0.74
    
- The OHE categorical feature set was more promising, consistenly scoring over 0.83, which is the best individual subgroup model so far
    
- Both Training/Test sets perform relatively similar, even with the ensemble posterior prediction adjustment, which is encouraging
    
- Cross Validation outlined various `hyperparameters` that give a mean cv score of 0.83, all of which give slightly different outcomes on train/test split, ranging from roughtly 0.8 to 0.89 for the test set, which seems to benefit the most from ensembling
    
</div><br>

# 4. Thank You
***
Thank you for reading, if you found any part of the notebook useful, please consider giving it an upvote.