# Dream Challenge: FINRISK - Heart Failure and Microbiome

### Overview

<div style="text-align: justify"> Cardiovascular diseases are the leading cause of death both in men and women worldwide. Heart failure (HF) is the most common form of heart disease, characterised by the heart's inability to pump a sufficient supply of blood to meet the needs of the body. The lifetime risk of developing HF is roughly 20%, yet, it remains difficult to diagnose due to its and a lack of agreement of diagnostic criteria. As the diagnosis of HF is dependent on ascertainment of clinical histories and appropriate screening of symptomatic individuals, identifying those at risk of HF is essential. <br> <br> This DREAM challenge focuses on the prediction of Heart Failure using a combination of gut microbiome and clinical variables. </div>

<img src="metadata.png"> <br> **Fig.1** List of metadata explained. <br> <br> *negative values indicates the occurrence of Heart Failure in participants before the baseline.

 </img>

------------------------------

# Exploring data

In [26]:
import pandas as pd

# csv's on test folder
pheno_train = r"../train/pheno_training.csv"
read_counts = r"../train/readcounts_training.csv"
taxtable = r"../train/taxtable.csv"


Host phenotype data (pheno_test.csv): Individuals in rows and metadata variables in columns.

In [27]:
pheno_train_df = pd.read_csv(pheno_train)
pheno_train_df.rename(columns={"Unnamed: 0": "Sample_ID"}, inplace=True) # rename the first column to Sample_ID
pheno_train_df["Sample_ID"] = pheno_train_df["Sample_ID"].str.replace("Simulated_", "") # Change Sample_ID to only integer values
pheno_train_df.head()

Unnamed: 0,Sample_ID,Age,BodyMassIndex,Smoking,BPTreatment,PrevalentDiabetes,PrevalentCHD,PrevalentHFAIL,Event,Event_time,SystolicBP,NonHDLcholesterol,Sex
0,328,53.618,24.127,0.0,0,0.0,0.0,0.0,0.0,15.75,133.077,3.02,0
1,1644,36.811,27.992,0.0,0,0.0,0.0,0.0,0.0,15.881,108.914,5.48,0
2,1710,49.429,23.664,0.0,0,0.0,0.0,0.0,0.0,15.891,110.064,4.388,1
3,1732,48.842,26.804,0.0,0,0.0,0.0,0.0,0.0,15.918,128.059,5.119,0
4,1727,60.738,29.862,0.0,0,0.0,0.0,0.0,0.0,15.841,169.913,5.74,1


Taxonomic abundance table (readcounts_test.csv): Individuals in columns and taxon names in rows.

In [28]:
read_counts_df = pd.read_csv(read_counts)
read_counts_df.head()

Unnamed: 0.1,Unnamed: 0,Simulated_328,Simulated_1644,Simulated_1710,Simulated_1732,Simulated_1727,Simulated_2196,Simulated_1681,Simulated_1651,Simulated_1603,...,Simulated_1676,Simulated_1630,Simulated_1605,Simulated_2202,Simulated_1682,Simulated_1783,Simulated_3425,Simulated_1789,Simulated_1592,Simulated_1731
0,k__Archaea;p__;c__;o__;f__;g__;s__,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,k__Archaea;p__Candidatus_Korarchaeota;c__;o__;...,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,k__Archaea;p__Crenarchaeota;c__Thermoprotei;o_...,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,k__Archaea;p__Crenarchaeota;c__Thermoprotei;o_...,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,k__Archaea;p__Crenarchaeota;c__Thermoprotei;o_...,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Taxonomic mapping table (taxtable.csv): Mapping of the taxonomic species to higher taxonomic hierarchy (Kingdom, Phylum, Class, Order, Family, Genus and Species)

In [29]:
taxtable_df = pd.read_csv(taxtable)
taxtable_df.head()

Unnamed: 0,Domain,Phylum,Class,Order,Family,Genus,Species
0,k__Archaea,p__,c__,o__,f__,g__,s__
1,k__Archaea,p__Candidatus_Korarchaeota,c__,o__,f__,g__,s__
2,k__Archaea,p__Crenarchaeota,c__Thermoprotei,o__,f__,g__,s__
3,k__Archaea,p__Crenarchaeota,c__Thermoprotei,o__Acidilobales,f__Acidilobaceae,g__Acidilobus,s__Acidilobus_saccharovorans
4,k__Archaea,p__Crenarchaeota,c__Thermoprotei,o__Acidilobales,f__Caldisphaeraceae,g__Caldisphaera,s__Caldisphaera_lagunensis
