# Introduction

In this notebook we explore metagenomics data. This dataset was created by the team of Edoardo Pasolli, Duy Tin Truong, Faizan Malik, Levi Waldron, and Nicola Segata; they published [a research article in July of 2016](https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1004977). The authors used 8 publicly available metagenomic datasets, and applied [MetaPhlAn2](https://github.com/segatalab/metaml#metaml---metagenomic-prediction-analysis-based-on-machine-learning) to generate species abundance features. **Here we are only going to focus on one of those datasets for now to predict type 2 diabetes.**

## Logistics behind the Input Data

This notebook was created to further explore the meta-genomics data on kaggle. The link to the data-set is: https://www.kaggle.com/antaresnyc/metagenomics. The datasets include:
* abundance.txt: a table containing the abundances of each organism type
  * the first 210 features include meta-data about the samples
  * the rest of the features include the abundance data in float-type
* marker_presence.txt: a table containing the presence of strain-specific markers. 
  * the first 210 features include meta-data about the samples (same as abundance.txt)
  * In a previous notebook I converted the marker presence feature data into a sparse matrix for easier downloading. This sparse matrix is found on [kaggle](https://www.kaggle.com/sklasfeld/metagenomics-marker-presence-sparse-matrix).
* markers2clades_DB.txt: a lookup table to associate each marker identifier to the corresponding species.

In summary we have 210 samples. We know the abundance of the organisms in the sample. If an organism is in a sample we have strain-specific marker information.

See Table 1 in the paper for a summary of the datasets considered in the experiment.

## Libraries
Below I import some librarys that may be useful and then print the input files

In [None]:
import pip

!pip install --upgrade pip

def import_or_install(package):
    try:
        __import__(package)
    except ImportError:
        pip.main(['install', package])  

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import os # accessing directory structure
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import scipy
import scipy.sparse

import networkx as nx # creates graph data-structures
#import_or_install('node2vec')
#from node2vec import Node2Vec # embeds graphs as vectors

# plot with matplotlib
import matplotlib.pyplot as plt
%matplotlib inline

#from plotnine import * # used to plot data

# progress bar
from tqdm import tqdm

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
%%time

# install pytorch libraries
!pip install torch-scatter -f https://pytorch-geometric.com/whl/torch-1.7.0+10.2.html
!pip install torch-sparse -f https://pytorch-geometric.com/whl/torch-1.7.0+10.2.html
!pip install torch-cluster -f https://pytorch-geometric.com/whl/torch-1.7.0+10.2.html
!pip install torch-spline-conv -f https://pytorch-geometric.com/whl/torch-1.7.0+10.2.html
!pip install torch-geometric

In [None]:
from sklearn import preprocessing

import torch
from torch_geometric.utils.convert import from_networkx
from torch_geometric.data import InMemoryDataset
from torch_geometric.data import Data


In [None]:
marker_presence_matrix_file="/kaggle/input/metagenomics-marker-presence-sparse-matrix/marker_presence_matrix.npz"
markers2clades_DB_file="/kaggle/input/human-metagenomics/markers2clades_DB.csv"
abundance_file="/kaggle/input/human-metagenomics/abundance.csv"
marker_presence_table_file="/kaggle/input/human-metagenomics/marker_presence.csv"

# Cleaning the Data
To organize the data we think of it as three different data types:
1. The meta information (found in the first 210 columns of both the abundance and marker_presense tables)
2. The abundance data
3. The marker_presense data


## Cleaning meta features
Recall that each of the given datasets are made up of multiple datasets which do not all include the same meta information.

In [None]:
%%time

# import only the columns with the sample meta-data information
samples_df = pd.read_csv(abundance_file,
                         sep=",", dtype=object,usecols=range(0,210))

Different datasets were focused on specific disease and non-disease types

In [None]:
# print a table of the different dataset
# and disease combinations
pd.DataFrame(samples_df.loc[:,['dataset_name','pubmedid','disease']].value_counts()).sort_values('dataset_name')

To start, let's focus on the `t2dmeta_long` dataset.

In [None]:
t2dml_samples_df = samples_df.loc[samples_df['dataset_name']=="t2dmeta_long",:].copy()

It looks like `nd`, `na`, `unknown` and `-` all stands for no data. Therefore let's replace these values all with np.NaN

In [None]:
t2dml_samples_df = t2dml_samples_df.replace("nd", np.NaN)
t2dml_samples_df = t2dml_samples_df.replace("na", np.NaN)
t2dml_samples_df = t2dml_samples_df.replace("-", np.NaN)
t2dml_samples_df = t2dml_samples_df.replace(' -', np.NaN)
t2dml_samples_df = t2dml_samples_df.replace('unknown', np.NaN)

We can remove all columns that have only 1 value (not including NaN). These do not seem to be too informative anyway.

In [None]:
# change the if statement to visualize
if 1==0:
    for col in t2dml_samples_df.loc[:, t2dml_samples_df.nunique() == 1].columns:
        t2dml_samples_df[col].fillna("NaN").value_counts().sort_values().plot(
            kind = 'bar', title=col)
        plt.show()
        
t2dml_samples_df = t2dml_samples_df.loc[:, t2dml_samples_df.nunique() > 1].copy()
t2dml_samples_df.columns

I want to convert some columns into booleans. For example if the values are either:
* "yes","no", or null
* "y","n", or null
* "positve", "negative", or null
* "a"(affected), "u" (unaffected), or null

I want to convert them into `2`, `1`, and `0` respectively.

In [None]:
bool_vals={'True':1,
          'False':-1,
          'Null':0}
for col in t2dml_samples_df.loc[:, t2dml_samples_df.nunique() < 4]:
    if ("yes" in t2dml_samples_df[col].unique() and "no" in t2dml_samples_df[col].unique()):
            t2dml_samples_df[col] = t2dml_samples_df[col].fillna(bool_vals['Null'])
            t2dml_samples_df =t2dml_samples_df.replace({col: {'yes': bool_vals['True'], 'no': bool_vals['False']}})
    elif ("y" in t2dml_samples_df[col].unique() and "n" in t2dml_samples_df[col].unique()):
            t2dml_samples_df[col] = t2dml_samples_df[col].fillna(bool_vals['Null'])
            t2dml_samples_df =t2dml_samples_df.replace({col: {'y': bool_vals['True'], 'n': bool_vals['False']}})
    elif ("positive" in t2dml_samples_df[col].unique() and "negative" in t2dml_samples_df[col].unique()):
            t2dml_samples_df[col] = t2dml_samples_df[col].fillna(bool_vals['Null'])
            t2dml_samples_df =t2dml_samples_df.replace({col: {'positive': bool_vals['True'], 'negative': bool_vals['False']}})
    elif ("a" in t2dml_samples_df[col].unique() and "u" in t2dml_samples_df[col].unique()):
            t2dml_samples_df[col] = t2dml_samples_df[col].fillna(bool_vals['Null'])
            t2dml_samples_df =t2dml_samples_df.replace({col: {'a': bool_vals['True'], 'u': bool_vals['False']}})

Similarly, for columns that contain 2 values (not including null) I will convert the values to numbers. For example, I will change the column named "gender" to "gender:Female|Male". The values will be 1 for Female, 2 for Male, and 0 for null.

In [None]:
for col in t2dml_samples_df.loc[:, t2dml_samples_df.nunique() == 2].columns:
    if (not(True in t2dml_samples_df[col].unique() and 
             False in t2dml_samples_df[col].unique())):
        val_i = 0
        first_val_null=True
        first_val = np.NaN
        while (first_val_null):
            first_val = t2dml_samples_df[col].unique()[val_i]
            if first_val == first_val:
                first_val_null = False
            else:
                val_i += 1
        val_i += 1
        second_val_null=True
        second_val= np.NaN
        while (second_val_null):
            second_val = t2dml_samples_df[col].unique()[val_i]
            if second_val == second_val:
                second_val_null = False
            else:
                val_i += 1
        new_col_name=("%s:%s|%s" % (col,first_val, second_val))
        # change the column name
        t2dml_samples_df = (t2dml_samples_df.rename(
            columns={col:new_col_name}))
        # change values in the column
        t2dml_samples_df[new_col_name] = t2dml_samples_df[new_col_name].fillna(bool_vals['Null'])
        t2dml_samples_df =t2dml_samples_df.replace({new_col_name: {first_val: bool_vals['False'],
                                                       second_val: bool_vals['True']}})
categorical_cols=t2dml_samples_df.loc[:, t2dml_samples_df.nunique() < 20].columns

## Building models based on abundance file
The abundance data values are reported for different taxonomic ranks (kingdom, phylum, division, class, order, family, genus, and species) of the bacteria phylogenetic tree. The values, which are formatted into a table structure,would make more sense to a machine in a tree format. Therefore, we convert these values into a graph object using the `networkx` libary.

first we import abundance data into a pandas dataframe. To start we are only looking at data from the `t2dmeta_long` dataset. We create a table with the meta-data features and one with only the abundance values.

In [None]:
%%time

# read in abundance values from abundance file
abundance_df = pd.read_csv(abundance_file,sep=",", dtype=object).iloc[:,211:]
# collect columns with abundance values
abundance_cols = list(abundance_df.columns)
# filter for only samples from the `t2dmeta_long` dataset
t2dml_abundance_values_df = abundance_df.iloc[(
    list(t2dml_samples_df.index))]
# merge meta-data features with abundance features
t2dml_abundance_df = t2dml_samples_df.merge(
    abundance_df, how='inner',left_index=True, right_index=True)
t2dml_abundance_df.shape

### Model based on NetworkX
We create a dictionary called `graph_dict` to hold the directed graphs for each sample. The key is in the index of the sample in the dataframe, and the value is the `networkx` graph `G`.

We build 'G' by adding looking at each of the values in each row. If the value is greater than 0 then we can add a node to the graph. If the column name includes information beyond the kingdom then we can add a node to the previous node in the hierarcy. 

In [None]:
graph_dict = {} # dictionary of graphs per sample from the t2dmeta_long dataset
#n2v_dict={} # dictionary of vectors from graphs per sample
for i in list(t2dml_abundance_values_df.index):# single edge as tuple of two nodes range(len(abundance_df)):
    G = nx.DiGraph()
    for tree_str, abundance_val in t2dml_abundance_values_df.loc[i].to_dict().items():
        if float(abundance_val) > 0:
            bacteria_list = tree_str.split('|')
            G.add_node(bacteria_list[-1],amount=abundance_val)
            if len(bacteria_list) > 1:
                G.add_edge(*bacteria_list[-2:]) # add single edge as tuple of two nodes
    graph_dict[i]=G
    #n2v_dict[i] = Node2Vec(G, dimensions=64, walk_length=30, num_walks=200, workers=4)

Next we need to embed these graphs into an architecture that can be imported into a neural network. 

Ideas for the next step include converting each graph into a vector using the `Node2vec` constructor. See the documentation at: https://github.com/eliorc/node2vec. 

Another idea includes using a PopPhy-CNN. A published architecture found at: https://doi.org/10.1101/257931. The github repo is at https://github.com/derekreiman/PopPhy-CNN.

## Building model with Pytorch

The following code was inspired by https://towardsdatascience.com/hands-on-graph-neural-networks-with-pytorch-pytorch-geometric-359487e221a8

The *torch_geometric.data* module contains a [Data()](https://pytorch-geometric.readthedocs.io/en/latest/_modules/torch_geometric/data/data.html#Data) class that we will use to create graphs for the abundance data. In our case, the Data() class requires three parameters:
1. **x**: a torch.LongTensor() containing an numpy array of 2D lists. The 2D list has the attributes/features associated with each node. In our case, the attributes are the bacteria IDs given by the column name, and the features are the abundance values. 
2. **edge_index**: a torch.tensor() of data type `torch.long` containing an array of 2 lists.The first list contains the parent IDs of each node and the second list has the child bacteria IDs with respect to the first list.
3. **y**: a FloatTensor() which contains the list of target values for each sample

We use this Data() structure to build a custom dataset below:

In [None]:
class t2dmlDataset(InMemoryDataset):
    def __init__(self, root, transform=None, pre_transform=None):
        super(t2dmlDataset, self).__init__(root, transform, pre_transform)
        self.data, self.slices = torch.load(self.processed_paths[0])

    @property
    def raw_file_names(self):
        return []
    @property
    def processed_file_names(self):
        return ['../input/human-metagenomics/t2dmlDataset.dataset']

    def download(self):
        pass
    
    def process(self):
        samples_df = t2dml_samples_df
        abundance_df = t2dml_abundance_values_df
        data_list = []

        # get embedding for target values (y)
        le = preprocessing.LabelEncoder()
        y = le.fit_transform(samples_df['disease:n|t2d'])

        # get emnedding of all nodes
        le_nodes = preprocessing.LabelEncoder()

        # encode labels between 0 and n_classes-1 for each bacterial label
        le_nodes.fit([col.split('|')[-1] for col in  abundance_df.columns])
        
        # process by each sample (row)
        row_range = range(len(abundance_df))
        for i in tqdm(row_range):
            node_list = [] # list of [$cur_bacteria_name,$abundance_val]
            edge_list = [] # list of [$parent_bacteria_name,$cur_bacteria_name]
            for key, val in abundance_df.iloc[i].to_dict().items():
                if float(val) > 0:
                    bacteria_list = key.split('|')
                    node = [le_nodes.transform([bacteria_list[-1]])[0],float(val)]
                    node_list.append(node)
                    if len(bacteria_list) >= 2:
                        edge_list.append(le_nodes.transform(bacteria_list[-2:]))
            # convert `y`, `node_list`, and `edge_list` into Tensor formats
            edge_array = np.array(edge_list)
            edge_index = torch.tensor([edge_array[:,0],edge_array[:,1]],dtype=torch.long)
            #print(np.array(node_list))
            node_features = torch.LongTensor(np.array(node_list))
            label = torch.FloatTensor([y[i]])
            # set these Tensors into a pytorch Data() object
            # which is used to model graphs
            data = Data(node_features,edge_index=edge_index,y=label)
            data_list.append(data)
        
        data, slices = self.collate(data_list)
        torch.save((data, slices), self.processed_paths[0])

Here we can build the dataset.

In [None]:
%%time

dataset = t2dmlDataset('../input/human-metagenomics')

After building the dataset, we call shuffle() to make sure it has been randomly shuffled and then split it into three sets for training, validation, and testing.

In [None]:
dataset = dataset.shuffle()
train_datalist = dataset[len(t2dml_abundance_values_df) // 10:]
test_datalist = dataset[:len(t2dml_abundance_values_df) // 10]

It may be faster to do below instead...

In [None]:
# return a list of pytorch Data() objects which each contain a graph
# for each row in abundance_df
def graph_label(samples_df,abundance_df):
    
    # get embedding for target values (y)
    le = preprocessing.LabelEncoder()
    y = le.fit_transform(samples_df['disease:n|t2d'])
    print(y)
    
    # get emnedding of all nodes
    le_nodes = preprocessing.LabelEncoder()
    
    # encode labels between 0 and n_classes-1 for each bacterial label
    le_nodes.fit([col.split('|')[-1] for col in  abundance_df.columns])
    data_list = []
    
    # process by each sample (row)
    for i in range(len(abundance_df)):
        node_list = [] # list of [$cur_bacteria_name,$abundance_val]
        edge_list = [] # list of [$parent_bacteria_name,$cur_bacteria_name]
        for key, val in abundance_df.iloc[i].to_dict().items():
            if float(val) > 0:
                bacteria_list = key.split('|')
                node = [le_nodes.transform([bacteria_list[-1]])[0],float(val)]
                node_list.append(node)
                if len(bacteria_list) >= 2:
                    edge_list.append(le_nodes.transform(bacteria_list[-2:]))
        # convert `y`, `node_list`, and `edge_list` into Tensor formats
        edge_array = np.array(edge_list)
        edge_index = torch.tensor([edge_array[:,0],edge_array[:,1]],dtype=torch.long)
        #print(np.array(node_list))
        node_features = torch.LongTensor(np.array(node_list))
        label = torch.FloatTensor([y[i]])
        # set these Tensors into a pytorch Data() object
        # which is used to model graphs
        data = Data(node_features,edge_index=edge_index,y=label)
        data_list.append(data)
    return data_list

We split the data so that 10% is in the training set and 90% is in the testing set. Then we use the DataLoader() function to import these datasets into pytorch.

In [None]:
data_list = graph_label(t2dml_samples_df,t2dml_abundance_values_df)
train_datalist = data_list[len(data_list) // 10:]
test_datalist = data_list[:len(data_list) // 10]

train_loader = DataLoader(train_datalist, batch_size=32, shuffle=True)
test_loader = DataLoader(test_datalist, batch_size=4)

Now that we loaded the data, it is time to run some models! See this article for more info: https://towardsdatascience.com/hands-on-graph-neural-networks-with-pytorch-pytorch-geometric-359487e221a8

## Cleaning Marker Presence file
Once we have created a model using the abundance data we can try to add the marker-presense information into the model. However, for now we will put this on hold. Ignore the code below. 

In [None]:
%%time 
# Import this file without the first 211 columns (since we already dealt with those previously). 
# This file could be imported as a sparse numpy matrix. it is very large.
if 1==0:
    markers_reader = pd.read_csv(
            marker_presence_table_file,
            sep=",", 
            dtype=object,
            usecols=range(211,288558),
            nrows=10)
    print(markers_reader.head())