# Introduction
# Goal classify diseases using obtained abundance features, and to determine best ML models for this task
In this notebook we explore metagenomics data. This dataset was created by the team of Edoardo Pasolli, Duy Tin Truong, Faizan Malik, Levi Waldron, and Nicola Segata; they published [a research article in July of 2016](https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1004977). The authors used 8 publicly available metagenomic datasets, and applied [MetaPhlAn2](https://github.com/segatalab/metaml#metaml---metagenomic-prediction-analysis-based-on-machine-learning) to generate species abundance features.

## Logistics behind the Input Data

This notebook was created to further explore the meta-genomics data on kaggle. The link to the data-set is: https://www.kaggle.com/antaresnyc/metagenomics. The datasets include:
* abundance.txt: a table containing the abundances of each organism type
  * the first 210 features include meta-data about the samples
  * the rest of the features include the abundance data in float-type
* marker_presence.txt: a table containing the presence of strain-specific markers. 
  * the first 210 features include meta-data about the samples (same as abundance.txt)
  * In a previous notebook I converted the marker presence feature data into a sparse matrix for easier downloading. This sparse matrix is found on [kaggle](https://www.kaggle.com/sklasfeld/metagenomics-marker-presence-sparse-matrix).
* markers2clades_DB.txt: a lookup table to associate each marker identifier to the corresponding species.

In summary we have 210 samples. We know the abundance of the organisms in the sample. If an organism is in a sample we have strain-specific marker information.

## Libraries
Below I import some librarys that may be useful and then print the input files

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import os # accessing directory structure
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import scipy

# plot with matplotlib
import matplotlib.pyplot as plt
%matplotlib inline

#from plotnine import * # used to plot data

# progress bar
from tqdm import tqdm

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
marker_presence_matrix_file="/kaggle/input/metagenomics-marker-presence-sparse-matrix/marker_presence_matrix.npz"
markers2clades_DB_file="/kaggle/input/human-metagenomics/markers2clades_DB.csv"
abundance_file="/kaggle/input/human-metagenomics/abundance.csv"
marker_presence_table_file="/kaggle/input/human-metagenomics/marker_presence.csv"

# Cleaning the Data
The marker matrix is dependent on the abundance table in that strain-specific markers can only appear if a specific strain is abundant. Both tables can be merged together using a join-function on the 210 sample meta-data columns. However these columns are very messy. Therefore let's clean them before we move on to understanding the rest of the data.

## Testing the meta data

The meta data information is given in both the marker_presence and abundance tables. I just wanted to make sure they contain the same information.

In [None]:
%%time

samples_df = pd.read_csv(abundance_file,
                         sep=",", dtype=object,usecols=range(0,210))

In [None]:
%%time
# if 1 == 0:
samples_df2 = pd.read_csv(marker_presence_table_file,
                         sep=",", dtype=object,usecols=range(0,210))

In [None]:
# if 1 == 0:
samples_df.compare(samples_df2, align_axis=0)

It looks like they are basically the same so I can move forward using `samples_df`

In [None]:
samples_df.describe()

## Cleaning meta features

remove all column with only one value

In [None]:
samples_df = samples_df.loc[:, samples_df.nunique() > 1].copy()

Next I look at categorical columns (AKA any feature that has 20 possible values or less)

In [None]:

for col in samples_df.loc[:, samples_df.nunique() < 20]:
    print("%s:%i" % (col,samples_df[col].nunique()))
    print(samples_df[col].unique())
    print("")

It looks like `nd`, `na`, `unknown` and `-` all stands for no data. Therefore let's replace these values all with np.NaN

In [None]:
samples_df = samples_df.replace("nd", np.NaN)
samples_df = samples_df.replace("na", np.NaN)
samples_df = samples_df.replace("-", np.NaN)
samples_df = samples_df.replace(' -', np.NaN)
samples_df = samples_df.replace('unknown', np.NaN)

We can remove all columns that have only 1 values and NaN. These do not seem to be too informative anyway.

In [None]:
# change the if statement to visualize
# if 1==0:
for col in samples_df.loc[:, samples_df.nunique() == 1].columns:
    samples_df[col].fillna("NaN").value_counts().sort_values().plot(
        kind = 'bar', title=col)
    plt.show()

samples_df = samples_df.loc[:, samples_df.nunique() > 1].copy()

I want to convert some columns into booleans. For example if the values are either:
* "yes","no", or null
* "y","n", or null
* "positve", "negative", or null
* "a"(affected), "u" (unaffected), or null

I want to convert them into `2`, `1`, and `0` respectively.

In [None]:
bool_vals={'True':2,
          'False':1,
          'Null':0}

for col in samples_df.loc[:, samples_df.nunique() < 4]:
    if ("yes" in samples_df[col].unique() and "no" in samples_df[col].unique()):
            samples_df[col] = samples_df[col].fillna(bool_vals['Null'])
            samples_df =samples_df.replace({col: {'yes': bool_vals['True'], 'no': bool_vals['False']}})
    elif ("y" in samples_df[col].unique() and "n" in samples_df[col].unique()):
            samples_df[col] = samples_df[col].fillna(bool_vals['Null'])
            samples_df =samples_df.replace({col: {'y': bool_vals['True'], 'n': bool_vals['False']}})
    elif ("positive" in samples_df[col].unique() and "negative" in samples_df[col].unique()):
            samples_df[col] = samples_df[col].fillna(bool_vals['Null'])
            samples_df =samples_df.replace({col: {'positive': bool_vals['True'], 'negative': bool_vals['False']}})
    elif ("a" in samples_df[col].unique() and "u" in samples_df[col].unique()):
            samples_df[col] = samples_df[col].fillna(bool_vals['Null'])
            samples_df =samples_df.replace({col: {'a': bool_vals['True'], 'u': bool_vals['False']}})

Similarly, for columns that contain 2 values (not including null) I will convert the values to numbers. For example, I will change the column named "gender" to "gender:Female|Male". The values will be 1 for Female, 2 for Male, and 0 for null.

In [None]:
for col in samples_df.loc[:, samples_df.nunique() == 2].columns:
    if (not(True in samples_df[col].unique() and 
             False in samples_df[col].unique())):
        val_i = 0
        first_val_null=True
        first_val = np.NaN
        while (first_val_null):
            first_val = samples_df[col].unique()[val_i]
            if first_val == first_val:
                first_val_null = False
            else:
                val_i += 1
        val_i += 1
        second_val_null=True
        second_val= np.NaN
        while (second_val_null):
            second_val = samples_df[col].unique()[val_i]
            if second_val == second_val:
                second_val_null = False
            else:
                val_i += 1
        new_col_name=("%s:%s|%s" % (col,first_val, second_val))
        # change the column name
        samples_df = (samples_df.rename(
            columns={col:new_col_name}))
        # change values in the column
        samples_df[new_col_name] = samples_df[new_col_name].fillna(bool_vals['Null'])
        samples_df =samples_df.replace({new_col_name: {first_val: bool_vals['False'],
                                                       second_val: bool_vals['True']}})
categorical_cols=samples_df.loc[:, samples_df.nunique() < 20].columns

It was brought to my attention that most samples come from stool. Therefore it makes sense that we remove other types of samples.

In [None]:
samples_df['bodysite'].value_counts().plot(kind='bar')

In [None]:
samples_df['bodysite'] == 'stool'
print(np.sum(samples_df.nunique() < 3))

Unfortonately this didn't help remove any features from the meta data.

In [None]:
stool_samp_df = samples_df.loc[samples_df['bodysite'] == 'stool',:].copy()


## Cleaning abundance file

import full abundance file

In [None]:
from pandas import DataFrame
from IPython.display import HTML

In [None]:
%%time

abundance_df = (pd.read_csv(abundance_file,sep=",", dtype=object)
               .iloc[:,211:])
abundance_df = samples_df.merge(abundance_df, how='left',
                               left_index=True, right_index=True)

In [None]:
abundance_df.head()

In [None]:
abundance_df.describe()

In [None]:
def show_missing(df):
    """
    Return the total missing values and the percentage of
    missing values by column.
    """
    null_count = df.isnull().sum()
    null_percentage = (null_count / df.shape[0]) * 100
    empty_count = pd.Series(((df == ' ') | (df == '')).sum())
    empty_percentage = (empty_count / df.shape[0]) * 100
    nan_count = pd.Series(((df == 'nan') | (df == 'NaN')).sum())
    nan_percentage = (nan_count / df.shape[0]) * 100
    return pd.DataFrame({'num_missing': null_count, 'missing_percentage': null_percentage,
                         'num_empty': empty_count, 'empty_percentage': empty_percentage,
                         'nan_count': nan_count, 'nan_percentage': nan_percentage})

In [None]:
show_missing(abundance_df)

There are 502 datapoints with no disease information listed. Since the goal is to predict disease state based on abundance features, we should remove these from the dataset.

In [None]:
from pandas import DataFrame
from IPython.display import HTML

In [None]:
abundance_df = abundance_df[~abundance_df.disease.isnull()].copy()
missing = show_missing(abundance_df)
HTML(DataFrame(missing).to_html())

In [None]:
for col in abundance_df:
    print(col)

In [None]:
#Calculate the percentage of each disease  category.
abundance_df.disease.value_counts(normalize=True)

#plot the bar graph of percentage job categories
abundance_df.disease.value_counts(normalize=True).plot.barh()
plt.show()

* n = normal?
* n_relative = ?
* obesity and obese category. Unsure why they are separate. 
* unsure what y is


In [None]:
#Calculate the percentage of each disease  category.
abundance_df.bodysite.value_counts(normalize=True)

#plot the bar graph of percentage job categories
abundance_df.bodysite.value_counts(normalize=True).plot.barh()
plt.show()

Is there a correlation between bodysite and disease?

In [None]:
import seaborn as sns
sns.displot(abundance_df, y='bodysite', hue='disease', multiple="stack", stat="density")

In [None]:
nonstool_df = abundance_df.loc[abundance_df["bodysite"] != "stool"]

In [None]:
sns.displot(nonstool_df, y='bodysite', hue='disease', multiple="stack", stat="density")

Recommend excluded all non-stool data. The vast majority of non-stool samples are only representative of normal state and Y state (whatever that means)

In [None]:
abundance_df = abundance_df.loc[abundance_df["bodysite"] == "stool"]

In [None]:
sns.displot(abundance_df, y='disease', hue='dataset_name', multiple="stack", stat="density")

In [None]:
obese_df = abundance_df.loc[(abundance_df["disease"] == "overweight") | (abundance_df["disease"] == "obese") | (abundance_df["disease"] == "obesity") | (abundance_df["disease"] == "underweight")| (abundance_df["disease"] == "leaness")]

obese_df.groupby(["dataset_name", "disease"]).count()[["sampleID", "bmi"]]

Clean up columns that are duplicated 


In [None]:
#Create a subset of only bacteria columns
bacteria_names = []
sub_abun_df = abundance_df.iloc[:,203:]
for item in sub_abun_df.columns:
    bacteria_names.append(item)

In [None]:
def getDuplicateColumns(df):
    duplicateColumnNames = set()
# Iterate over all the columns in dataframe
    for x in range(df.shape[1]):
        # Select column at xth index.
        col = df.iloc[:, x]
        # Iterate over all the columns in DataFrame from (x+1)th index till end
        for y in range(x + 1, df.shape[1]):
            # Select column at yth index.
            otherCol = df.iloc[:, y]
            # Check if two columns at x 7 y index are equal
            if col.equals(otherCol):
                duplicateColumnNames.add(df.columns.values[y])
    return list(duplicateColumnNames)

In [None]:
abundance_df = abundance_df.drop(columns=getDuplicateColumns(abundance_df))

In [None]:
# Add a column that sums the number of different bacterial species
# Average abundance ?

#Maintain column names if column name == another column name
# New column name = column_name + AGG
# .sum()
# .count()
# .mean()