# microbial community disease risk prediction

In this case study, we'll develop a neural network to predict disease risk from microbial community sequence data.

We have 16S rDNA sequence data from 16,344 samples, roughly half of which are from individuals who have been diagnosed with type 1 diabetes (aka, "cases"), and half of which are from individuals who do not have type 1 diabetes ("controls").

The data are available in a github repository as a comma-separated values (.csv) file. So, we can use the pandas library to downoad the sequence data and associated disease-state labels to a pandas dataframe:

In [None]:
import pandas
dataframe = pandas.read_csv('https://raw.githubusercontent.com/bryankolaczkowski/ALS3200C/main/mbiome.data.csv')
dataframe.head()

There are 256 "DTA" columns, lebelled DTA0, ..., DTA255. Each of these DTA columns represents a particular bacterial "species" found in the samples. The 'relative abundance' of each species in each sample (row) is reported. Relative abundance values have been center log-rato transformed, which is a common method used to 'normalize' microbial relative abundance data.

In a typical analysis of 16S rDNA sequence data, the 'relative abundance' of each sequence in the sample is given as the *number* of sequence reads matching that sequence in the sample. One *could* divide each sequence count by the total number of counts in that sample, which would produce a typical 'relative abundance' value between 0.0 (not found in the sample) and 1.0 (the *only* sequence found in the sample).

However, it's more common to perform some sort of log-ratio transform of the sequence count data. Log-ratio transforms have a couple of advantages over the 'frequency transform' above. First, putting numbers on a log scale often makes them more 'normally distributed', which typically provides a better fit to the assumptions of most statistical models. Second, the log scale can be 'centered' at zero, with positive and negative values indicating deviations from the 'average' value of zero; this 'centering' often leads to better results from machine-learning and neural-network models.

The center log-ratio transform is simple to calculate and is commonly used for microbial community sequence projects. One simply divides each sequence's count by the *geometric* mean of the total counts over all sequences in the sample, and then takes the logarithm of this 'ratio'.

These data have already been center log-ratio transformed, and you can see that the values typically range between about +2.5 and -2.5.

The final column in the data file, labelled "LBL0" is the 'disease state' indicator (the label we'd like to predict), with 0 indicating a 'control' individual with no type 1 diabetes diagnosis, and 1 indicating a 'case' individual who has been diagnosed with type 1 diabetes.

Our goal is to predict the LBL0 classification, given the microbial sequence information in columns DTA0, ..., DTA255.

First, let's split our data into training and validation subsets, and extract the explanatory variables and labels.

Much of the following code cell should look familiar. Given the pandas dataframe, we first sample 80% of the data for training, and leave the remaining 20% for validation.

Next, we extract the columns starting with "DTA" as the explanatory variables. In this case, we need to 'expand' the data dimension, so we can model these data using a tensorflow sequence model (like a Conv1D or LSTM model).

Finally, we extract the LBL0 entries as our binary class labels.

In [None]:
import numpy as np

# create train-validate split
train_dataframe = dataframe.sample(frac=0.8, random_state=2100963)
valid_dataframe = dataframe.drop(train_dataframe.index)
print(train_dataframe.shape, valid_dataframe.shape, dataframe.shape)

# extract explanatory variables
dta_ids = [ x for x in dataframe.columns if x.find('DTA') == 0 ]
train_x = np.expand_dims(train_dataframe[dta_ids].to_numpy(), axis=-1)
valid_x = np.expand_dims(valid_dataframe[dta_ids].to_numpy(), axis=-1)
print(train_x.shape, valid_x.shape)

# extract labels
train_y = train_dataframe['LBL0'].to_numpy()
valid_y = valid_dataframe['LBL0'].to_numpy()
print(train_y.shape, valid_y.shape)

We can see that there are 16,344 total samples in our dataframe. We've extracted 13,057 for training and 3,269 for validation.

After expanding the data dimension, we have explanatory variables of shape (256,1), a one-dimensional sequence of 256 bacterial 'species'.

XX