# Bayesian Model Averaging Logistic Regression

In this notebook we will use Bayesian Model Averaging (BMA) to understand a logistic regression problem.  The data coronary heart disease (0 = does not have CHD, 1 = has CHD), depending on a number of medical predictor variables.

In [None]:
import numpy as np 
import pandas as pd 
import statsmodels.api as sm
from statsmodels.tools import add_constant
from itertools import combinations

Load the data, and check the head.

In [None]:
df = pd.read_csv('/kaggle/input/coronary-heart-disease/CHDdata.csv')
df["famhist"] = (df["famhist"] == "Present")*1 # converts the famhit to 0 (no hist) and 1 (has hist)
#df = df.drop(["famhist"], axis=1)
df.head()

In [None]:
X = df.drop(["chd"], axis=1)
y = df["chd"]
# building the model and fitting the data 
log_reg = sm.Logit(y, add_constant(X)).fit()
log_reg.summary()

In [None]:
# Seaborn visualization library
import seaborn as sns
# Create the default pairplot
g = sns.pairplot(df, hue="chd", palette="tab10", markers=["o", "D"])

# Bayesian Model Averaging
Here we define the class that will perform our BMA analysis.

For any model $M_i$ (each model is defined by the set of predictor varialbes being used in the model), Bayes theorem tells us that the probability for $M_i$ is
\begin{equation}
p(M_i|X,y)=\frac{p(X,y|M_i)p(M_i)}{p(X,y)}.
\end{equation}

Using our previous formulas, this becomes,
\begin{equation}
p(M_i|X,y)=\frac{e^{−\text{BIC}_i/2}p(M_i)}{\sum_k e^{−\text{BIC}_k/2}p(M_k)}.
\end{equation}

So far, we have just done Bayesian analysis to compute a posterior probability distribution on the parameters.  But now we can do more with the 'averaging' part of BMA.

The probability for any predictor variable is the sum of the probabilities for all models contiaining that predictor variable, and the expected value for the coefficient of the predictor variable is the average value of the coefficient over all models containing the variable, weighted by the probability of each model.  That is,
\begin{equation}
p(X_k) = \sum_{M_i \text{such that } X_k\in M_i} p(M_i|X,y),
\end{equation}
and
\begin{equation}
E[\beta_k] = \sum_{M_i \text{such that } X_k\in M_i} p(M_i|X,y)\times \beta_k^{(i)},
\end{equation}
where $\beta_k^{(i)}$ is the coefficient of $X_k$ in model $M_i$.

Here is code for a BMA class that will do the Bayeisan Model Averaging.  This is the same as the code from the Bayesian Model Averaging notebook https://www.kaggle.com/billbasener/bayesian-model-averaging-regression-tutorial-pt-2, but with that added capability to do logistic regression via the keyword RegType to "Logit".

In [None]:
from mpmath import mp
mp.dps = 50
class BMA:
    
    def __init__(self, y, X, **kwargs):
        # Setup the basic variables.
        self.y = y
        self.X = X
        self.names = list(X.columns)
        self.nRows, self.nCols = np.shape(X)
        self.likelihoods = mp.zeros(self.nCols,1)
        self.likelihoods_all = {}
        self.coefficients_mp = mp.zeros(self.nCols,1)
        self.coefficients = np.zeros(self.nCols)
        self.probabilities = np.zeros(self.nCols)
        # Check the max model size. (Max number of predictor variables to use in a model.)
        # This can be used to reduce the runtime but not doing an exhaustive sampling.
        if 'MaxVars' in kwargs.keys():
            self.MaxVars = kwargs['MaxVars']
        else:
            self.MaxVars = self.nCols  
        # Prepare the priors if they are provided.
        # The priors are provided for the individual regressor variables.
        # The prior for a model is the product of the priors on the variables in the model.
        if 'Priors' in kwargs.keys():
            if np.size(kwargs['Priors']) == self.nCols:
                self.Priors = kwargs['Priors']
            else:
                print("WARNING: Provided priors error.  Using equal priors instead.")
                print("The priors should be a numpy array of length equal tot he number of regressor variables.")
                self.Priors = np.ones(self.nCols)  
        else:
            self.Priors = np.ones(self.nCols)  
        if 'Verbose' in kwargs.keys():
            self.Verbose = kwargs['Verbose'] 
        else:
            self.Verbose = False 
        if 'RegType' in kwargs.keys():
            self.RegType = kwargs['RegType'] 
        else:
            self.RegType = 'LS' 
        
    def fit(self):
        # Perform the Bayesian Model Averaging
        
        # Initialize the sum of the likelihoods for all the models to zero.  
        # This will be the 'normalization' denominator in Bayes Theorem.
        likelighood_sum = 0
        
        # To facilitate iterating through all possible models, we start by iterating thorugh
        # the number of elements in the model.  
        max_likelihood = 0
        for num_elements in range(1,self.MaxVars+1): 
            
            if self.Verbose == True:
                print("Computing BMA for models of size: ", num_elements)
            
            # Make a list of all index sets of models of this size.
            Models_next = list(combinations(list(range(self.nCols)), num_elements)) 
             
            # Occam's window - compute the candidate models to use for the next iteration
            # Models_previous: the set of models from the previous iteration that satisfy (likelihhod > max_likelihhod/20)
            # Models_next:     the set of candidate models for the next iteration
            # Models_current:  the set of models from Models_next that can be consturcted by adding one new variable
            #                    to a model from Models_previous
            if num_elements == 1:
                Models_current = Models_next
                Models_previous = []
            else:
                idx_keep = np.zeros(len(Models_next))
                for M_new,idx in zip(Models_next,range(len(Models_next))):
                    for M_good in Models_previous:
                        if(all(x in M_new for x in M_good)):
                            idx_keep[idx] = 1
                            break
                        else:
                            pass
                Models_current = np.asarray(Models_next)[np.where(idx_keep==1)].tolist()
                Models_previous = []
                        
            
            # Iterate through all possible models of the given size.
            for model_index_set in Models_current:
                
                # Compute the linear regression for this given model. 
                model_X = self.X.iloc[:,list(model_index_set)]
                if self.RegType == 'Logit':
                    model_regr = sm.Logit(self.y, model_X).fit(disp=0)
                else:
                    model_regr = OLS(self.y, model_X).fit()
                
                # Compute the likelihood (times the prior) for the model. 
                model_likelihood = mp.exp(-model_regr.bic/2)*np.prod(self.Priors[list(model_index_set)])
                    
                if (model_likelihood > max_likelihood/20):
                    if self.Verbose == True:
                        print("Model Variables:",model_index_set,"likelihood=",model_likelihood)
                    self.likelihoods_all[str(model_index_set)] = model_likelihood
                    
                    # Add this likelihood to the running tally of likelihoods.
                    likelighood_sum = mp.fadd(likelighood_sum, model_likelihood)

                    # Add this likelihood (times the priors) to the running tally
                    # of likelihoods for each variable in the model.
                    for idx, i in zip(model_index_set, range(num_elements)):
                        self.likelihoods[idx] = mp.fadd(self.likelihoods[idx], model_likelihood, prec=1000)
                        self.coefficients_mp[idx] = mp.fadd(self.coefficients_mp[idx], model_regr.params[i]*model_likelihood, prec=1000)
                    Models_previous.append(model_index_set) # add this model to the list of good models
                    max_likelihood = np.max([max_likelihood,model_likelihood]) # get the new max likelihood if it is this model
                else:
                    if self.Verbose == True:
                        print("Model Variables:",model_index_set,"rejected by Occam's window")
                    

        # Divide by the denominator in Bayes theorem to normalize the probabilities 
        # sum to one.
        self.likelighood_sum = likelighood_sum
        for idx in range(self.nCols):
            self.probabilities[idx] = mp.fdiv(self.likelihoods[idx],likelighood_sum, prec=1000)
            self.coefficients[idx] = mp.fdiv(self.coefficients_mp[idx],likelighood_sum, prec=1000)
        
        # Return the new BMA object as an output.
        return self
    
    def predict(self, data):
        data = np.asarray(data)
        if self.RegType == 'Logit':
            try:
                result = 1/(1+np.exp(-1*np.dot(self.coefficients,data)))
            except:
                result = 1/(1+np.exp(-1*np.dot(self.coefficients,data.T)))
        else:
            try:
                result = np.dot(self.coefficients,data)
            except:
                result = np.dot(self.coefficients,data.T)
        
        return result  
        
    def summary(self):
        # Return the BMA results as a data frame for easy viewing.
        df = pd.DataFrame([self.names, list(self.probabilities), list(self.coefficients)], 
             ["Variable Name", "Probability", "Avg. Coefficient"]).T
        return df  

Now we split our data into input X dataframe and an output y datafram, and run our BMA analysis.

In [None]:
result = BMA(y,add_constant(X), RegType = 'Logit', Verbose=True).fit()

In [None]:
result.summary()

In [None]:
result.likelihoods_all

In [None]:
# predict the y-values from training input data
pred_BMA = result.predict(add_constant(X))
pred_Logit = log_reg.predict(add_constant(X))

In [None]:
# plot the predictions with the actual values
import matplotlib.pyplot as plt
plt.scatter(pred_BMA,y-0.05)
plt.scatter(pred_Logit,y)
plt.xlabel("Predicted Probability")
plt.ylabel("Coronary Heart Disease \n(0=Not Present, 1=Present)")
plt.legend(['pred_BMA','pred_Logit'])

In [None]:
# compute accuracy
print("BMA Accuracy: ", np.sum((pred_BMA > 0.5) == y)/len(y))
print("Logit Accuracy: ", np.sum((pred_Logit > 0.5) == y)/len(y))