## Project: Development of a reduced pediatric injury prediction model
Created by: Thomas Hartka, MD, MS  
Date created: 12/18/20  
  
This notebook reads in the results of the all combinations modeling and performs Bayesian model averaging.

In [1]:
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
import scipy.stats as st
import matplotlib.pyplot as plt
from itertools import combinations

## Set outcome

In [2]:
# outcome of interest
#  ISS -> ISS>=16
#  TIL -> any injury on target injury list
outcome = "ISS"

## Read in results

In [3]:
if outcome == "ISS":
    results = pd.read_csv("../Results/Model_avg_10x-ext_pred-ISS.csv")
elif outcome == "TIL":
    results = pd.read_csv("../Results/Model_avg_10x-ext_pred-TIL.csv")
else:
    raise Exception("Outcome not valid") 

## Set variables

In [4]:
predictors = ['sex','age_5_9', 'age_10_14','age_15_18',
              'prop_restraint','any_restraint','front_row', 
              'dvtotal','pdof_rear','pdof_nearside','pdof_farside', 
              'rolled','multicoll','ejection',
              'splimit','abdeply','entrapment']

## Calculate mean AUC from cross-validation

In [5]:
# convert coefficient to binary 
results_bin = results[predictors].applymap(lambda x: 1 if x!=0 else 0)

# add AUC results into resultsm with binary coefficients
results_bin['AUC'] = results.AUC

# determine mean AUC for each model
results_bin = results_bin.groupby(predictors).mean().reset_index()

# Bayesian Model Averaging
The following equation was used to perform BMA for this analysis:
$$
Pr(\theta_i \neq 0 |X) = 
\frac{
    Pr(X|\theta_i \neq 0)Pr(\theta_i)
}
{
\sum^K_{k=1}
    Pr(X|\theta_k \neq 0)Pr(\theta_k)
}
\quad
\quad
\text{
 Eqt. 1
}$$

Eqt 1 calculates the posterior probability that coefficient $\theta_i$ for the variable $\beta_i$ is non-zero, given data $X$.  We used a uniform posterior probabilities, so $Pr(\theta_i)=Pr(\theta_k)$ and these terms cancel out.  The marginal probability therefore reduces to the sum of the likelihoods for all variables.  
  
The likelihood of each variable was calculated by:

$$
Pr(X|\theta_i \neq 0) = 
\sum_{M_l: \beta_i \in M_l}
    Pr(X|M_l)
\quad\quad\text{Eqt. 2}$$

Eqt 2. sums the probabilities of all models $M_l$ that contain the variable $\beta_i$ .
  
The probability of the model given the data was calculated as:
$$
Pr(X|M_l) = 
\frac{
    AUC(M_l|X)-0.5
}
{
    0.5
} 
= 
2 \cdot AUC(M_l|X)-1
\quad\quad\text{Eqt. 3}$$

This adjustment to AUC was performed since an AUC of 0.5 represents a model with no discriminatory ability.  Eqt 3. normalizes these probabilities to values between 0.0 and 1.0.  
  
By combine these equations, we can rewrite posterior probability from Eqt. 1 as:

$$
Pr(\theta_i \neq 0 |X) = 
\frac{
    \sum_{M_l: \beta_i \in M_l}
        [2 \cdot AUC(M_l|X)-1]
}
{
\sum^K_{k=1}
    \sum_{M_l: \beta_k \in M_l}
        [2 \cdot AUC(M_l|X)-1]
}
\quad
\quad
\text{
 Eqt. 4
}$$

## Calculate Likelihoods
  
The likelihood is determined by the sum of the discriminatory power of all models containing a variable.  

In [6]:
# subtract 0.5 to get discriminary ability
results_bin['discrim'] = (2*results_bin.AUC)-1

# insert discriminatory values into variable positions
discrim = results_bin[predictors].mul(results_bin.discrim, axis=0)

In [7]:
# determine number of variables in each model
results_bin['num_vars'] = np.sum(results_bin.iloc[:,0:len(predictors)],axis=1)

# sum the likelihoods values for all variables
bf_sum = np.sum(results_bin.discrim * results_bin.num_vars)

## Find posterior probabilities

The posterior probability for each variable was determined by the equation:


In [8]:
# find posterior prob for each variable
post_prob = discrim.sum()/bf_sum

# get ordered list
post_prob = pd.DataFrame(post_prob).rename(columns={0:'post_prob'}).sort_values('post_prob', ascending=False)

post_prob

Unnamed: 0,post_prob
dvtotal,0.068915
ejection,0.059512
any_restraint,0.059482
entrapment,0.059461
rolled,0.059083
multicoll,0.058459
prop_restraint,0.058213
pdof_farside,0.058018
pdof_nearside,0.05797
splimit,0.057921


In [9]:
# make index column and rename
post_prob = post_prob.reset_index().rename(columns={'index':'variable'})

## Store results

In [10]:
post_prob.to_csv("../Results/Var_prob_10x-ext_pred-"+outcome+".csv",index=False)