# Introduction

In this notebook we explore metagenomics data. This dataset was created by the team of Edoardo Pasolli, Duy Tin Truong, Faizan Malik, Levi Waldron, and Nicola Segata; they published [a research article in July of 2016](https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1004977). The authors used 8 publicly available metagenomic datasets, and applied [MetaPhlAn2](https://github.com/segatalab/metaml#metaml---metagenomic-prediction-analysis-based-on-machine-learning) to generate species abundance features.

## Logistics behind the Input Data

This notebook was created to further explore the meta-genomics data on kaggle. The link to the data-set is: https://www.kaggle.com/antaresnyc/metagenomics. The datasets include:
* abundance.txt: a table containing the abundances of each organism type
  * the first 210 features include meta-data about the samples
  * the rest of the features include the abundance data in float-type
* marker_presence.txt: a table containing the presence of strain-specific markers. 
  * the first 210 features include meta-data about the samples (same as abundance.txt)
  * In a previous notebook I converted the marker presence feature data into a sparse matrix for easier downloading. This sparse matrix is found on [kaggle](https://www.kaggle.com/sklasfeld/metagenomics-marker-presence-sparse-matrix).
* markers2clades_DB.txt: a lookup table to associate each marker identifier to the corresponding species.

In summary we have 210 samples. We know the abundance of the organisms in the sample. If an organism is in a sample we have strain-specific marker information.

## Libraries
Below I import some librarys that may be useful and then print the input files

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import os # accessing directory structure
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import scipy
import scipy.sparse
import networkx as nx
from sklearn import preprocessing

# plot with matplotlib
import matplotlib.pyplot as plt
%matplotlib inline

#from plotnine import * # used to plot data

# progress bar
from tqdm import tqdm

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
marker_presence_matrix_file="/kaggle/input/metagenomics-marker-presence-sparse-matrix/marker_presence_matrix.npz"
markers2clades_DB_file="/kaggle/input/human-metagenomics/markers2clades_DB.csv"
abundance_file="/kaggle/input/human-metagenomics/abundance.csv"
marker_presence_table_file="/kaggle/input/human-metagenomics/marker_presence.csv"

# Cleaning the Data
The marker matrix is dependent on the abundance table in that strain-specific markers can only appear if a specific strain is abundant. Both tables can be merged together using a join-function on the 210 sample meta-data columns. However these columns are very messy. Therefore let's clean them before we move on to understanding the rest of the data.

## Testing the meta data

The meta data information is given in both the marker_presence and abundance tables. I just wanted to make sure they contain the same information.

In [None]:
%%time

samples_df = pd.read_csv(abundance_file,
                         sep=",", dtype=object,usecols=range(0,210))

In [None]:
%%time
if 1 == 0:
    samples_df2 = pd.read_csv(marker_presence_table_file,
                         sep=",", dtype=object,usecols=range(0,210))

In [None]:
if 1 == 0:
    samples_df.compare(samples_df2, align_axis=0)

It looks like they are basically the same so I can move forward using `samples_df`

In [None]:
samples_df.describe()

In [None]:
samples_df.query('dataset_name in ["t2dmeta_long","t2dmeta_short"]')['disease'].unique()

## Cleaning meta features

remove all column with only one value

In [None]:
samples_df = samples_df.loc[:, samples_df.nunique() > 1].copy()

Next I look at categorical columns (AKA any feature that has 20 possible values or less)

In [None]:
if 1 == 0:
    for col in samples_df.loc[:, samples_df.nunique() < 20]:
        print("%s:%i" % (col,samples_df[col].nunique()))
        print(samples_df[col].unique())
        print("")

It looks like `nd`, `na`, `unknown` and `-` all stands for no data. Therefore let's replace these values all with np.NaN

In [None]:
samples_df = samples_df.replace("nd", np.NaN)
samples_df = samples_df.replace("na", np.NaN)
samples_df = samples_df.replace("-", np.NaN)
samples_df = samples_df.replace(' -', np.NaN)
samples_df = samples_df.replace('unknown', np.NaN)

We can remove all columns that have only 1 values and NaN. These do not seem to be too informative anyway.

In [None]:
# change the if statement to visualize
if 1==0:
    for col in samples_df.loc[:, samples_df.nunique() == 1].columns:
        samples_df[col].fillna("NaN").value_counts().sort_values().plot(
            kind = 'bar', title=col)
        plt.show()
        
samples_df = samples_df.loc[:, samples_df.nunique() > 1].copy()

I want to convert some columns into booleans. For example if the values are either:
* "yes","no", or null
* "y","n", or null
* "positve", "negative", or null
* "a"(affected), "u" (unaffected), or null

I want to convert them into `2`, `1`, and `0` respectively.

In [None]:
bool_vals={'True':2,
          'False':1,
          'Null':0}
for col in samples_df.loc[:, samples_df.nunique() < 4]:
    if ("yes" in samples_df[col].unique() and "no" in samples_df[col].unique()):
            samples_df[col] = samples_df[col].fillna(bool_vals['Null'])
            samples_df =samples_df.replace({col: {'yes': bool_vals['True'], 'no': bool_vals['False']}})
    elif ("y" in samples_df[col].unique() and "n" in samples_df[col].unique()):
            samples_df[col] = samples_df[col].fillna(bool_vals['Null'])
            samples_df =samples_df.replace({col: {'y': bool_vals['True'], 'n': bool_vals['False']}})
    elif ("positive" in samples_df[col].unique() and "negative" in samples_df[col].unique()):
            samples_df[col] = samples_df[col].fillna(bool_vals['Null'])
            samples_df =samples_df.replace({col: {'positive': bool_vals['True'], 'negative': bool_vals['False']}})
    elif ("a" in samples_df[col].unique() and "u" in samples_df[col].unique()):
            samples_df[col] = samples_df[col].fillna(bool_vals['Null'])
            samples_df =samples_df.replace({col: {'a': bool_vals['True'], 'u': bool_vals['False']}})

Similarly, for columns that contain 2 values (not including null) I will convert the values to numbers. For example, I will change the column named "gender" to "gender:Female|Male". The values will be 1 for Female, 2 for Male, and 0 for null.

In [None]:
for col in samples_df.loc[:, samples_df.nunique() == 2].columns:
    if (not(True in samples_df[col].unique() and 
             False in samples_df[col].unique())):
        val_i = 0
        first_val_null=True
        first_val = np.NaN
        while (first_val_null):
            first_val = samples_df[col].unique()[val_i]
            if first_val == first_val:
                first_val_null = False
            else:
                val_i += 1
        val_i += 1
        second_val_null=True
        second_val= np.NaN
        while (second_val_null):
            second_val = samples_df[col].unique()[val_i]
            if second_val == second_val:
                second_val_null = False
            else:
                val_i += 1
        new_col_name=("%s:%s|%s" % (col,first_val, second_val))
        # change the column name
        samples_df = (samples_df.rename(
            columns={col:new_col_name}))
        # change values in the column
        samples_df[new_col_name] = samples_df[new_col_name].fillna(bool_vals['Null'])
        samples_df =samples_df.replace({new_col_name: {first_val: bool_vals['False'],
                                                       second_val: bool_vals['True']}})
categorical_cols=samples_df.loc[:, samples_df.nunique() < 20].columns

It was brought to my attention that most samples come from stool. Therefore it makes sense that we remove other types of samples.

In [None]:
samples_df['bodysite'].value_counts().plot(kind='bar')

In [None]:
samples_df['bodysite'] == 'stool'
print(np.sum(samples_df.nunique() < 3))

Unfortonately this didn't help remove any features from the meta data.

In [None]:
stool_samp_df = samples_df.loc[samples_df['bodysite'] == 'stool',:].copy()


## Cleaning abundance file

import abundance file without the first 211 columns (since we already dealt with those above)

In [None]:
%%time

abundance_df = (pd.read_csv(abundance_file,sep=",", dtype=object)
               .iloc[:,211:])
abundance_df.head()

I am wondering if we can remove any columns that are redundant. In other words, I would like to remove columns that have identical values.

In the following for-loop I simply check columns that are consecutive of one another to see if they are identical. I then use the biggest category as the key in `redundant_dict` and all the sub-categories that are equal in the list values.

In [None]:
#seen_list=[]
redundant_dict={}
remove_cols=[]
i=0
while i < abundance_df.shape[1]:
    j=i+1
    next_step=abundance_df.shape[1]
    while j < abundance_df.shape[1]:
        #print("%i,%i" % (i,j))
        col_i = abundance_df.columns[i]
        col_j = abundance_df.columns[j]
        if col_i in col_j:
            #print(abundance_df.iloc[:,i].equals(abundance_df.iloc[:,j]))
            if abundance_df.iloc[:,i].equals(abundance_df.iloc[:,j]):
                # add redundant column name to data-structure
                remove_cols.append(col_j)
                if col_i in redundant_dict: redundant_dict[col_i].append(col_j)
                else: redundant_dict[col_i]= [col_j]
                # next look at i vs j+1
                
            else:
                #print("next_step: "+ str(next_step))
                if next_step > j:
                    next_step=j
            
            j += 1
            if j == abundance_df.shape[1]:
                i = j
        else:
            if next_step < j:
                i=next_step
                next_step=abundance_df.shape[1]
            else:
                i=j
            j=abundance_df.shape[1]

drop redundant columns

In [None]:
abundance_df = (
        abundance_df.drop(
            remove_cols,
            axis=1))

In [None]:
print(len(remove_cols))

1,441 columns were dropped from further analysis since they were redundant with parent columns. 

In [None]:
# Note: to get the full abudance file back with cleaning using the following code:
if 1==0:
    c = samples_df.merge(abundance_df, how='left',
                               left_index=True, right_index=True)

In [None]:
samples_df['dataset_name']

## Cleaning Marker Presence file
Import this file without the first 211 columns (since we already dealt with those previously). This file could be imported as a sparse numpy matrix. it is very large.

In [None]:
%%time 

markers_reader = pd.read_csv(
        marker_presence_table_file,
        sep=",", 
        dtype=object,
        usecols=range(211,288558),
        nrows=10)

In [None]:
markers_reader

## Construct a graph for genomic part
In order to capture a tree structure of the genes, we construct a directed graph where an node represents each bacteria and edge represents parent-child relationship. To quantify the presence of each bacteria, I set up a vector as a node property.

In [None]:
def graph_label(samples_df,abundance_df,dataset=None):
    if dataset:
        dataset = dataset if isinstance(dataset,list) else [dataset]
        ids = samples_df['dataset_name'].isin(dataset)
        samples_df = samples_df[ids].reset_index(drop=False)
        abundance_df = abundance_df[ids].reset_index(drop=False)
    le = preprocessing.LabelEncoder()
    # target values
    y = le.fit_transform(samples_df['disease'])
    # get emnedding of all nodes
    le_nodes = preprocessing.LabelEncoder()
    # encode labels between 0 and n_classes-1 for each bacterial label
    le_nodes.fit([gene for col in abundance_df.columns for gene in col.split('|')])
    max_id = np.max(le_nodes.transform([gene for col in abundance_df.columns for gene in col.split('|')]))
    data_list = []
    for i in range(len(abundance_df)):
        node_list = [] # list of [$cur_bacteria_name,$abundance_val]
        edge_list = [] # list of [$parent_bacteria_name,$cur_bacteria_name]
        for key, val in abundance_df.iloc[i].to_dict().items():
            if float(val) > 0:
                bacteria_list = key.split('|')
                node = [le_nodes.transform([bacteria_list[-1]])[0],float(val)]
                node_list.append(node)
                if len(bacteria_list) >= 2:
                    edge_list.append(le_nodes.transform(bacteria_list[-2:]))
        # convert `y`, `node_list`, and `edge_list` into Tensor formats
        edge_array = np.array(edge_list)
        
        edge_index = torch.tensor([edge_array[:,0],edge_array[:,1]],dtype=torch.long)
        #print(np.array(node_list))
        node_features = torch.LongTensor(np.array(node_list))
        label = torch.FloatTensor([y[i]])
        # set these Tensors into a pytorch Data() object
        # which is used to model graphs
        data = Data(node_features,edge_index=edge_index,y=label)
        data_list.append(data)
    return data_list,max_id

In [None]:
import torch; print(torch.__version__)

In [None]:
print(torch.version.cuda)

In [None]:
!pip install torch-scatter -f https://pytorch-geometric.com/whl/torch-1.7.0+10.2.html
!pip install torch-sparse -f https://pytorch-geometric.com/whl/torch-1.7.0+10.2.html
!pip install torch-cluster -f https://pytorch-geometric.com/whl/torch-1.7.0+10.2.html
!pip install torch-spline-conv -f https://pytorch-geometric.com/whl/torch-1.7.0+10.2.html
!pip install torch-geometric

In [None]:
from torch_geometric.utils.convert import from_networkx
from torch_geometric.data import InMemoryDataset
from torch_geometric.data import Data

In [None]:
"""
class AbundanceDataset(InMemoryDataset):
    def __init__(self, root, transform=None, pre_transform=None,dataset = None):
        super(AbundanceDataset, self).__init__(root, transform, pre_transform)
        self.dataset = dataset
        self.data, self.slices = torch.load(self.processed_paths[0])
    @property
    def raw_file_names(self):
        return []
    @property
    def processed_file_names(self):
        return ['../input/yoochoose_click_binary_1M_sess.dataset']

    def download(self):
        pass
    def process(self):
        data_list = []
        graph_label_pair = graph_label()
        for G,value in zip(graph_label_pair['graph'],graph_label_pair['value']):
            data = from_networkx(G)
            data.y = torch.float
            data_list.append(data)
        data, slices = self.collate(data_list)
        torch.save((data, slices), self.processed_paths[0])
        
"""

In [None]:
from torch_geometric.data import DataLoader
from torch_geometric.nn import GINConv, global_add_pool

In [None]:
'''
data_list = []
graph_label_pair = graph_label(samples_df,abundance_df,dataset = ["t2dmeta_long","t2dmeta_short"])
for G,value in zip(graph_label_pair['graph'],graph_label_pair['value']):
    print(G)
    data = from_networkx(G)
    data.y = torch.float(value)
    data_list.append(data)
'''
t2dml_samples_df = samples_df.loc[samples_df['dataset_name']=="t2dmeta_long",:].copy()
t2dml_abundance_values_df = abundance_df.iloc[(list(t2dml_samples_df.index))]
# merge meta-data features with abundance features
t2dml_abundance_df = t2dml_samples_df.merge(abundance_df, how='inner',left_index=True, right_index=True)
data_list,max_id = graph_label(t2dml_samples_df,t2dml_abundance_values_df)
print(max_id)
train_datalist = data_list[len(data_list) // 10:]
test_datalist = data_list[:len(data_list) // 10]

train_loader = DataLoader(train_datalist, batch_size=16, shuffle=True)
test_loader = DataLoader(test_datalist, batch_size=4)

In [None]:
embed_dim = 32
CUDA_LAUNCH_BLOCKING=1
from torch.nn import Sequential as Seq, Linear, ReLU
from torch_geometric.utils import remove_self_loops, add_self_loops
from torch_geometric.nn import TopKPooling,MessagePassing
from torch_geometric.nn import global_mean_pool as gap, global_max_pool as gmp
import torch.nn.functional as F

class SAGEConv(MessagePassing):
    def __init__(self, in_channels, out_channels):
        super(SAGEConv, self).__init__(aggr='max') #  "Max" aggregation.
        self.lin = torch.nn.Linear(in_channels, out_channels)
        self.act = torch.nn.ReLU()
        self.update_lin = torch.nn.Linear(in_channels + out_channels, in_channels, bias=False)
        self.update_act = torch.nn.ReLU()
        
    def forward(self, x, edge_index):
        # x has shape [N, in_channels]
        # edge_index has shape [2, E]
        
        
        edge_index, _ = remove_self_loops(edge_index)
        edge_index, _ = add_self_loops(edge_index, num_nodes=x.size(0))
        
        
        return self.propagate(edge_index,x=x)

    def message(self, x_j):
        # x_j has shape [E, in_channels]

        x_j = self.lin(x_j)
        x_j = self.act(x_j)
        
        return x_j

    def update(self, aggr_out, x):
        # aggr_out has shape [N, out_channels]


        new_embedding = torch.cat([aggr_out, x], dim=1)
        
        new_embedding = self.update_lin(new_embedding)
        new_embedding = self.update_act(new_embedding)
        
        return new_embedding


class Net(torch.nn.Module):
    def __init__(self):
        super(Net, self).__init__()

        self.conv1 = SAGEConv(embed_dim, embed_dim)
        self.pool1 = TopKPooling(embed_dim, ratio=0.8)
        self.conv2 = SAGEConv(embed_dim, embed_dim)
        self.pool2 = TopKPooling(embed_dim, ratio=0.8)
        self.conv3 = SAGEConv(embed_dim, embed_dim)
        self.pool3 = TopKPooling(embed_dim, ratio=0.8)
        self.item_embedding = torch.nn.Embedding(num_embeddings=max_id +1, embedding_dim=embed_dim)
        self.lin1 = torch.nn.Linear(embed_dim * 2, embed_dim)
        self.lin2 = torch.nn.Linear(embed_dim, int(embed_dim / 2))
        self.lin3 = torch.nn.Linear(int(embed_dim / 2), 1)
        self.bn1 = torch.nn.BatchNorm1d(embed_dim)
        self.bn2 = torch.nn.BatchNorm1d(int(embed_dim / 2))
        self.act1 = torch.nn.ReLU()
        self.act2 = torch.nn.ReLU()        
  
    def forward(self, data):
        x, edge_index, batch = data.x, data.edge_index, data.batch
        x = self.item_embedding(x)
        x = x.squeeze(1)        

        x = F.relu(self.conv1(x, edge_index))

        x, edge_index, _, batch, _ = self.pool1(x, edge_index, None, batch)
        x1 = torch.cat([gmp(x, batch), gap(x, batch)], dim=1)

        x = F.relu(self.conv2(x, edge_index))
     
        x, edge_index, _, batch, _ = self.pool2(x, edge_index, None, batch)
        x2 = torch.cat([gmp(x, batch), gap(x, batch)], dim=1)

        x = F.relu(self.conv3(x, edge_index))

        x, edge_index, _, batch, _ = self.pool3(x, edge_index, None, batch)
        x3 = torch.cat([gmp(x, batch), gap(x, batch)], dim=1)

        x = x1 + x2 + x3

        x = self.lin1(x)
        x = self.act1(x)
        x = self.lin2(x)
        x = self.act2(x)      
        x = F.dropout(x, p=0.5, training=self.training)

        x = F.log_softmax(self.lin3(x)).squeeze(1)

        return x

In [None]:
def train():
    model.train()

    total_loss = 0
    for data in train_loader:
        data = data.to(device)
        optimizer.zero_grad()
        output = model(data)
        loss = F.nll_loss(output, data.y)
        loss.backward()
        optimizer.step()
        total_loss += float(loss) * data.num_graphs
    return total_loss / len(train_loader.dataset)
device = torch.device('cuda')
model = Net().to(device)
optimizer = torch.optim.Adam(model.parameters(), lr=0.005)


In [None]:
def evaluate(loader):
    model.eval()

    predictions = []
    labels = []

    with torch.no_grad():
        for data in loader:

            data = data.to(device)
            pred = model(data).detach().cpu().numpy()

            label = data.y.detach().cpu().numpy()
            predictions.append(pred)
            labels.append(label)

In [None]:
for epoch in range(30):
    loss = train()
    train_acc = evaluate(train_loader)
    #val_acc = evaluate(val_loader)    
    test_acc = evaluate(test_loader)
    print('Epoch: {:03d}, Loss: {:.5f}, Train Auc: {:.5f}, Val Auc: {:.5f}, Test Auc: {:.5f}'.
          format(epoch, loss, train_acc, val_acc, test_acc))