# Table of Contents
 <p><div class="lev1"><a href="#Task-1.-Compiling-Ebola-Data"><span class="toc-item-num">Task 1.&nbsp;&nbsp;</span>Compiling Ebola Data</a></div>
 <div class="lev1"><a href="#Task-2.-RNA-Sequences"><span class="toc-item-num">Task 2.&nbsp;&nbsp;</span>RNA Sequences</a></div>
 <div class="lev1"><a href="#Task-3.-Class-War-in-Titanic"><span class="toc-item-num">Task 3.&nbsp;&nbsp;</span>Class War in Titanic</a></div></p>

In [202]:
import pandas as pd
import os
import numpy as np

In [203]:
#DATA_FOLDER = "/home/vinz/Desktop/ADA/ADA2017-Tutorials/02 - Intro to Pandas/Data" # Use the data folder provided in Tutorial 02 - Intro to Pandas.
DATA_FOLDER = "./Data"

## Task 1. Compiling Ebola Data

The `DATA_FOLDER/ebola` folder contains summarized reports of Ebola cases from three countries (Guinea, Liberia and Sierra Leone) during the recent outbreak of the disease in West Africa. For each country, there are daily reports that contain various information about the outbreak in several cities in each country.

Use pandas to import these data files into a single `Dataframe`.
Using this `DataFrame`, calculate for *each country*, the *daily average* per year of *new cases* and *deaths*.
Make sure you handle all the different expressions for *new cases* and *deaths* that are used in the reports.

## Task 2. RNA Sequences

In the `DATA_FOLDER/microbiome` subdirectory, there are 9 spreadsheets of microbiome data that was acquired from high-throughput RNA sequencing procedures, along with a 10<sup>th</sup> file that describes the content of each. 

Use pandas to import the first 9 spreadsheets into a single `DataFrame`.
Then, add the metadata information from the 10<sup>th</sup> spreadsheet as columns in the combined `DataFrame`.
Make sure that the final `DataFrame` has a unique index and all the `NaN` values have been replaced by the tag `unknown`.

In [204]:
# Setup for location of the dataset of the task 2
MICROBIOME_FOLDER = DATA_FOLDER + "/microbiome"


### Basic analysis of the file formatting:

***
**For the files MIDn.xls with n in [1,9]**

**column 1:**

We see that the first column in the files contain the scientific classification of the microbiomes
Altough it could be kept as a single string, it would have more meaning if splitted
The scientific classification contains the following subdivisions (https://en.wikipedia.org/wiki/Taxonomic_rank): 

       Domain, Kingdom, Phylum, Class, Order, Family, Genus, Species
       
 With this classification, we have a problem: only 6 strings are given in the data set, whilst we have 8 potential divisions in the classification.
 
 First, we note that in the scientific classification, Kingdom is not used with Bacterias (https://en.wikipedia.org/wiki/Bacteria) and in Archeas, it is always the same as the Phylum (https://en.wikipedia.org/wiki/Archaea). After checking the strings in the data set, it indeed seems that the Kingdom is never given. Therefore, **we will not use Kingdom as a division**.
 
 Also, Species only has a sense in the Eucaryote Domain, for which we have no data, so **we will not use Species as a division**.
 
 In addition, after playing with the data, we found cases where the Family can be named: "Incertae Sedis", which would be classified as the family and genus in our classification. Therefore, we need to check for those cases and re-concatenate the strings to return a proper list in every cases.

In [205]:
# Lets keep those names in a list for further use
scientific_classification = ["domain", "phylum", "class", "order", "family", "genus"]

**Column 2:**

We see that the second column contains an integer value. We can suppose that this value is the number of samples
containing the genus defined by the first column. There's no title to the column, so we don't know yet where
those values came from.

***

**For the file "Metadata.xls"**

**Column 1:**

Titled **"BARCODE"**, give xls file identifier for which the two other columns give more information

**Column 2:**

Titled **"GROUP"**, gives the groups from which each dataset has been sampled. Two informations are contained in those groups, the first is given by "NEC", "Control" or "EXTRACTION CONTROL" and the second is the numbering of the group (which is likely a phase of test), either "1" or "2". We will split those informations in two columns, because while using the dataset, we might want to combine all the "NEC" patients or all the patient for a specific phase.

**Column 3:**

Titled **"SAMPLE"**, gives the type of sample that was taken, either tissue, stool or NA. Each group had both types of samples taken. 

***

**PROOFREADING** Do you think control and extraction control might be the same thing? No, actually extraction control seems to be a test of the equipment done to validate that all types of samples can be extracted. Then, NEC vs control is likely a case where a treatment is given on the NEC subgroup and placebo/no treatment is given to the Control group.

### Desired formating of the data after analysis

The simple analysis above tells us what are the columns that we will want in our Data Frame

1. **6** column for classification, 1 for each classifier in the "scientific_classification" list. This will caracterise each microbiome individually.

2. **1** column will contain the value associated with each microbiome measurement.

3. **2** columns describing the group of the sample from the metadata. The first column will be called **Group Type** and will contain the "NEC", "Control" or "EXTRACTION CONTROL" value. The second column will be called **Group Phase** and will contain either "1", "2" or "unknow", "unknow" will be used for the "EXCTRACTON CONTROL" group.

4. **1** column describing the type of sample taken from the metadata. This column will be called **Sample** This will be either "tissue", "stool" or "unknow", "unknow" in the case of the EXTRACTION CONTROL group.

For a total of **10** columns


In [206]:
metadata_list = ["group_type", "group_phase", "sample"]

In [207]:
df_col_list = scientific_classification + metadata_list + ["value"]

In [208]:
print(len(df_col_list))
print(df_col_list)

10
['domain', 'phylum', 'class', 'order', 'family', 'genus', 'group_type', 'group_phase', 'sample', 'value']


***
### We want to extract the metadata from the metadata.xls file

The metadata is needed before we start extracting the data from the MB files to create the DataFrame with all the desired columns.


In [209]:
#The actual name of the excel sheet is "Sheet1" and not "Sheet 1" as in the other files.
metadata_raw = pd.read_excel(MICROBIOME_FOLDER+"/metadata.xls", sheetname='Sheet1', header=0)
metadata_raw.columns = metadata_raw.columns.str.lower()

In [210]:
## Extract group phase and group type
groups = metadata_raw["group"]
group_type = []
group_phase = []
for group in groups:
    # A special case for the extraction control, we don't want to split it
    if group == "EXTRACTION CONTROL":
        group_type.append(group)
        group_phase.append("")
    else:
        type, phase, *_ = group.split()
        group_type.append(type)
        group_phase.append(phase)

In [211]:
metadata_raw['group_type'] = pd.Series(group_type)
metadata_raw['group_phase'] = pd.Series(group_phase)
metadata = metadata_raw.drop("group",  axis=1)
metadata

Unnamed: 0,barcode,sample,group_type,group_phase
0,MID1,,EXTRACTION CONTROL,
1,MID2,tissue,NEC,1.0
2,MID3,tissue,Control,1.0
3,MID4,tissue,NEC,2.0
4,MID5,tissue,Control,2.0
5,MID6,stool,NEC,1.0
6,MID7,stool,Control,1.0
7,MID8,stool,NEC,2.0
8,MID9,stool,Control,2.0


In [212]:
# Make sure SAMPLE doesn't contain NaN, replace it by an empty string
# This will give use prettier column names later on
metadata["sample"].fillna(value="", inplace=True)
metadata

Unnamed: 0,barcode,sample,group_type,group_phase
0,MID1,,EXTRACTION CONTROL,
1,MID2,tissue,NEC,1.0
2,MID3,tissue,Control,1.0
3,MID4,tissue,NEC,2.0
4,MID5,tissue,Control,2.0
5,MID6,stool,NEC,1.0
6,MID7,stool,Control,1.0
7,MID8,stool,NEC,2.0
8,MID9,stool,Control,2.0


The format of the metadata is now as we want it

***
### We now want to extract the data from the datasheets

** We create a function to extract the classifications **

We want to extract the classifications from the first column of the datasheets. We will need to do string parsing. Regex are a great tool to remove the unwanted characters from the format. The only unwanted characters are the double quotes ". Also, as said before, we need to manage the case where we have the name "Incertae Sedis" as a Family (4th position)

In [213]:
import re

def get_classifiers(classifier_string):
    """Replaces " in the classifier_string and 
    splits the string to have an indexable list
    """
    classified_list = re.subn("\"", "", classifier_string)[0].lower().split()
    
    # Special case management
    if len(classified_list) > 6:
        # We join the first extra location with the family
        classified_list[4] = ' '.join([classified_list[4], classified_list[5]])
        del classified_list[5]
        
        # Allow for harbitrary number of words in the genus
        classified_list[5] = ' '.join(classified_list[6:])
        del classified_list[6:]
        
    return classified_list

** We can now attempt to create the desired dataframe **

In [214]:
# In this section, we test the analysis of a single datasheet

test_data = pd.read_excel(MICROBIOME_FOLDER+"/MID1.xls", sheetname='Sheet 1', header=None)
test_data.columns = ["raw_classification", "value"]
classifier_df = pd.DataFrame(columns=df_col_list)
classifier_array = [get_classifiers(row.raw_classification) for row in test_data.itertuples()]
classifier_series = pd.DataFrame(classifier_array, columns=scientific_classification)
classifier_series["value"] = test_data["value"]
classifier_series

Unnamed: 0,domain,phylum,class,order,family,genus,value
0,archaea,crenarchaeota,thermoprotei,desulfurococcales,desulfurococcaceae,ignisphaera,7
1,archaea,crenarchaeota,thermoprotei,desulfurococcales,pyrodictiaceae,pyrolobus,2
2,archaea,crenarchaeota,thermoprotei,sulfolobales,sulfolobaceae,stygiolobus,3
3,archaea,crenarchaeota,thermoprotei,thermoproteales,thermofilaceae,thermofilum,3
4,archaea,euryarchaeota,methanomicrobia,methanocellales,methanocellaceae,methanocella,7
5,archaea,euryarchaeota,methanomicrobia,methanosarcinales,methanosarcinaceae,methanimicrococcus,1
6,archaea,euryarchaeota,methanomicrobia,methanosarcinales,methermicoccaceae,methermicoccus,1
7,archaea,euryarchaeota,archaeoglobi,archaeoglobales,archaeoglobaceae,ferroglobus,1
8,archaea,euryarchaeota,archaeoglobi,archaeoglobales,archaeoglobaceae,geoglobus,1
9,archaea,euryarchaeota,halobacteria,halobacteriales,halobacteriaceae,haloplanus,1


In [215]:
# We reorder the metadata here, this will order the resulting data
metadata = metadata[["barcode", "group_phase", "group_type", "sample"]].sort_values(by=[ 'group_phase', 'group_type', 'sample'])
metadata.head(10)

Unnamed: 0,barcode,group_phase,group_type,sample
0,MID1,,EXTRACTION CONTROL,
6,MID7,1.0,Control,stool
2,MID3,1.0,Control,tissue
5,MID6,1.0,NEC,stool
1,MID2,1.0,NEC,tissue
8,MID9,2.0,Control,stool
4,MID5,2.0,Control,tissue
7,MID8,2.0,NEC,stool
3,MID4,2.0,NEC,tissue


In [216]:
# Loop over all the datasheets
clean_data = pd.DataFrame(columns=scientific_classification)
for metadata_row in metadata.itertuples():
    raw_data = pd.read_excel(MICROBIOME_FOLDER+"/"+metadata_row.barcode+".xls", sheetname='Sheet 1', header=None)
    # Change column names to something clearer
    raw_data.columns = ["raw_classification", "value"]
    
    # For each datasheet create a local classified set of data
    classifier_array = [get_classifiers(row.raw_classification) for row in raw_data.itertuples()]
    local_classified = pd.DataFrame(classifier_array, columns=scientific_classification)
    
    # Add the columns that are not the classification to the local set of data 
    local_classified[str(metadata_row.barcode)]  = test_data["value"]
    # Add the local data to the clean DataFrame
    clean_data = pd.merge(clean_data, local_classified, how="outer", on=scientific_classification)

clean_data.head(10)

Unnamed: 0,domain,phylum,class,order,family,genus,MID1,MID7,MID3,MID6,MID2,MID9,MID5,MID8,MID4
0,archaea,crenarchaeota,thermoprotei,desulfurococcales,desulfurococcaceae,ignisphaera,7.0,2.0,3.0,7.0,3.0,2.0,3.0,,7.0
1,archaea,crenarchaeota,thermoprotei,desulfurococcales,pyrodictiaceae,pyrolobus,2.0,1.0,,3.0,1.0,,1.0,,
2,archaea,crenarchaeota,thermoprotei,sulfolobales,sulfolobaceae,stygiolobus,3.0,1.0,1.0,7.0,1.0,7.0,1.0,7.0,
3,archaea,crenarchaeota,thermoprotei,thermoproteales,thermofilaceae,thermofilum,3.0,1.0,1.0,1.0,1.0,1.0,4.0,,
4,archaea,euryarchaeota,methanomicrobia,methanocellales,methanocellaceae,methanocella,7.0,1.0,2.0,1.0,2.0,1.0,4.0,,3.0
5,archaea,euryarchaeota,methanomicrobia,methanosarcinales,methanosarcinaceae,methanimicrococcus,1.0,2.0,12.0,4.0,1.0,2.0,2.0,,
6,archaea,euryarchaeota,methanomicrobia,methanosarcinales,methermicoccaceae,methermicoccus,1.0,4.0,2.0,2.0,12.0,4.0,1.0,,
7,archaea,euryarchaeota,archaeoglobi,archaeoglobales,archaeoglobaceae,ferroglobus,1.0,1.0,,4.0,2.0,1.0,1.0,,3.0
8,archaea,euryarchaeota,archaeoglobi,archaeoglobales,archaeoglobaceae,geoglobus,1.0,12.0,,,,12.0,1.0,,
9,archaea,euryarchaeota,halobacteria,halobacteriales,halobacteriaceae,haloplanus,1.0,2.0,1.0,1.0,1.0,2.0,2.0,,


** Lets find all rows that contains only NaN values **

Those rows are not usefull for the analysis and we will drop them.

In [217]:
# Get a table with True in place of a row where all the MID values are null
table_of_null_row = pd.isnull(clean_data[metadata.barcode]).all(axis=1)

# Get the associated indexes
index_of_null_row = table_of_null_row[table_of_null_row].index[:]

# Lets check that the value are actually null
clean_data.iloc[index_of_null_row]

Unnamed: 0,domain,phylum,class,order,family,genus,MID1,MID7,MID3,MID6,MID2,MID9,MID5,MID8,MID4
419,bacteria,proteobacteria,betaproteobacteria,burkholderiales,comamonadaceae,rhodoferax,,,,,,,,,
420,bacteria,proteobacteria,betaproteobacteria,burkholderiales,comamonadaceae,simplicispira,,,,,,,,,
421,bacteria,proteobacteria,betaproteobacteria,burkholderiales,comamonadaceae,tepidicella,,,,,,,,,
422,bacteria,proteobacteria,betaproteobacteria,burkholderiales,oxalobacteraceae,undibacterium,,,,,,,,,
423,bacteria,proteobacteria,betaproteobacteria,hydrogenophilales,hydrogenophilaceae,tepidiphilus,,,,,,,,,
425,bacteria,proteobacteria,betaproteobacteria,methylophilales,methylophilaceae,methylovorus,,,,,,,,,
426,bacteria,proteobacteria,betaproteobacteria,neisseriales,neisseriaceae,formivibrio,,,,,,,,,
427,bacteria,proteobacteria,betaproteobacteria,neisseriales,neisseriaceae,leeia,,,,,,,,,
428,bacteria,proteobacteria,betaproteobacteria,neisseriales,neisseriaceae,microvirgula,,,,,,,,,
430,bacteria,proteobacteria,betaproteobacteria,neisseriales,neisseriaceae,stenoxybacter,,,,,,,,,


In [218]:
clean_data = clean_data.drop(index_of_null_row)

** Now lets replace all the None values by unknow **

In [219]:
clean_data.fillna(value="unknown", inplace=True)

In [220]:
clean_data

Unnamed: 0,domain,phylum,class,order,family,genus,MID1,MID7,MID3,MID6,MID2,MID9,MID5,MID8,MID4
0,archaea,crenarchaeota,thermoprotei,desulfurococcales,desulfurococcaceae,ignisphaera,7,2,3,7,3,2,3,unknown,7
1,archaea,crenarchaeota,thermoprotei,desulfurococcales,pyrodictiaceae,pyrolobus,2,1,unknown,3,1,unknown,1,unknown,unknown
2,archaea,crenarchaeota,thermoprotei,sulfolobales,sulfolobaceae,stygiolobus,3,1,1,7,1,7,1,7,unknown
3,archaea,crenarchaeota,thermoprotei,thermoproteales,thermofilaceae,thermofilum,3,1,1,1,1,1,4,unknown,unknown
4,archaea,euryarchaeota,methanomicrobia,methanocellales,methanocellaceae,methanocella,7,1,2,1,2,1,4,unknown,3
5,archaea,euryarchaeota,methanomicrobia,methanosarcinales,methanosarcinaceae,methanimicrococcus,1,2,12,4,1,2,2,unknown,unknown
6,archaea,euryarchaeota,methanomicrobia,methanosarcinales,methermicoccaceae,methermicoccus,1,4,2,2,12,4,1,unknown,unknown
7,archaea,euryarchaeota,archaeoglobi,archaeoglobales,archaeoglobaceae,ferroglobus,1,1,unknown,4,2,1,1,unknown,3
8,archaea,euryarchaeota,archaeoglobi,archaeoglobales,archaeoglobaceae,geoglobus,1,12,unknown,unknown,unknown,12,1,unknown,unknown
9,archaea,euryarchaeota,halobacteria,halobacteriales,halobacteriaceae,haloplanus,1,2,1,1,1,2,2,unknown,unknown


We see that some genus are unknown... This seems a bit odd, so we will search if one other of the term is the actual genus

In [221]:
unknown_genus = clean_data[clean_data.genus == 'unknown']
unknown_genus

Unnamed: 0,domain,phylum,class,order,family,genus,MID1,MID7,MID3,MID6,MID2,MID9,MID5,MID8,MID4
148,bacteria,proteobacteria,alphaproteobacteria,alphaproteobacteria_incertae_sedis,elioraea,unknown,1,unknown,unknown,unknown,unknown,unknown,unknown,unknown,unknown
223,bacteria,proteobacteria,gammaproteobacteria,gammaproteobacteria_incertae_sedis,gilvimarinus,unknown,1,unknown,unknown,1,unknown,2,unknown,unknown,unknown
224,bacteria,proteobacteria,gammaproteobacteria,gammaproteobacteria_incertae_sedis,solimonas,unknown,2,unknown,unknown,unknown,unknown,162,unknown,unknown,unknown
269,bacteria,cyanobacteria,cyanobacteria,chloroplast,bangiophyceae,unknown,2,unknown,unknown,unknown,unknown,unknown,unknown,196,unknown
270,bacteria,cyanobacteria,cyanobacteria,chloroplast,chlorarachniophyceae,unknown,85,unknown,unknown,unknown,unknown,unknown,unknown,1,1
271,bacteria,cyanobacteria,cyanobacteria,chloroplast,streptophyta,unknown,1388,unknown,unknown,unknown,unknown,unknown,unknown,unknown,1
281,bacteria,actinobacteria,actinobacteria,acidimicrobidae_incertae_sedis,ilumatobacter,unknown,unknown,3,2,unknown,unknown,unknown,1,unknown,unknown
384,bacteria,proteobacteria,alphaproteobacteria,alphaproteobacteria_incertae_sedis,geminicoccus,unknown,unknown,1,unknown,unknown,unknown,unknown,unknown,unknown,unknown
455,bacteria,proteobacteria,gammaproteobacteria,gammaproteobacteria_incertae_sedis,sedimenticola,unknown,unknown,unknown,unknown,unknown,unknown,9,unknown,unknown,unknown
512,bacteria,bacteroidetes,bacteroidetes_incertae_sedis,marinifilum,unknown,unknown,unknown,unknown,1,unknown,unknown,unknown,unknown,7,12


From the list above, we can see that we have many cases where we have a string containing "incertae_sedis". Normally, this nomenclature is used when a genus doesn't have clear parents (https://en.wikipedia.org/wiki/Incertae_sedis). Because of this we will attempt to use the first class after the incertae_sedis as the name of the genus and keep the other values as unknown.

In the case of "(some term)_genera_incertae_sedis", it means that the term is the genus and the rest is unknown, so we will do a special parse for those cases.

In cases where we don't have incertae sedis, we will avoid doing manipulations. We don't want to induce errors in the dataset because we managed poorly the nomenclature for the specific cases. Because there's a small amount of thoses cases left after the analysis, it could be asked to an expert in the domain or obtained with more advanced research on the subject to add a case by case filter

In [237]:
def get_incertae_genus(class_, order):
    """Extracts genus from either class or order if either of them contain incertae_sedis"""
    for target in [class_, order]:
        print(target)
        genus, *rest = target.split('_incertae_sedis')
        if len(rest) > 0:
            return genus
    return 'unknown'

In [239]:
string.split?

Object `string.split` not found.


In [238]:
genuses = [get_incertae_genus(c, o) for *_, c, o in unknown_genus[['class', 'order']].itertuples()]

alphaproteobacteria
alphaproteobacteria_incertae_sedis
gammaproteobacteria
gammaproteobacteria_incertae_sedis
gammaproteobacteria
gammaproteobacteria_incertae_sedis
cyanobacteria
chloroplast
cyanobacteria
chloroplast
cyanobacteria
chloroplast
actinobacteria
acidimicrobidae_incertae_sedis
alphaproteobacteria
alphaproteobacteria_incertae_sedis
gammaproteobacteria
gammaproteobacteria_incertae_sedis
bacteroidetes_incertae_sedis
dehalococcoidetes
dehalogenimonas
acidobacteria_gp16
gp16
bacteroidetes_incertae_sedis


In [224]:
# Insert newly retrieved genus names
clean_data.loc[clean_data.genus == 'unknown', 'genus'] = genuses

In [225]:
# Find the remaining unknown genus, those will not be managed and would be left for later manual management
clean_data[clean_data.genus == 'unknown']

Unnamed: 0,domain,phylum,class,order,family,genus,MID1,MID7,MID3,MID6,MID2,MID9,MID5,MID8,MID4
269,bacteria,cyanobacteria,cyanobacteria,chloroplast,bangiophyceae,unknown,2,unknown,unknown,unknown,unknown,unknown,unknown,196,unknown
270,bacteria,cyanobacteria,cyanobacteria,chloroplast,chlorarachniophyceae,unknown,85,unknown,unknown,unknown,unknown,unknown,unknown,1,1
271,bacteria,cyanobacteria,cyanobacteria,chloroplast,streptophyta,unknown,1388,unknown,unknown,unknown,unknown,unknown,unknown,unknown,1
677,bacteria,chloroflexi,dehalococcoidetes,dehalogenimonas,unknown,unknown,unknown,unknown,unknown,unknown,unknown,6,1,unknown,unknown
712,bacteria,acidobacteria,acidobacteria_gp16,gp16,unknown,unknown,unknown,unknown,unknown,unknown,unknown,unknown,17,unknown,unknown


**PROOFREADING** How to replace all order/classes ending in _incertae_sedis with unknown? 
`clean_data.class.str.replace` but I do not know regexs.

In [261]:
for item in clean_data[["class", "order"]].itertuples():
    print(item)
 #   if item["class"].find("_incertae_sedis") != -1:
#        print(item)

Pandas(Index=0, _1='thermoprotei', order='desulfurococcales')
Pandas(Index=1, _1='thermoprotei', order='desulfurococcales')
Pandas(Index=2, _1='thermoprotei', order='sulfolobales')
Pandas(Index=3, _1='thermoprotei', order='thermoproteales')
Pandas(Index=4, _1='methanomicrobia', order='methanocellales')
Pandas(Index=5, _1='methanomicrobia', order='methanosarcinales')
Pandas(Index=6, _1='methanomicrobia', order='methanosarcinales')
Pandas(Index=7, _1='archaeoglobi', order='archaeoglobales')
Pandas(Index=8, _1='archaeoglobi', order='archaeoglobales')
Pandas(Index=9, _1='halobacteria', order='halobacteriales')
Pandas(Index=10, _1='halobacteria', order='halobacteriales')
Pandas(Index=11, _1='halobacteria', order='halobacteriales')
Pandas(Index=12, _1='halobacteria', order='halobacteriales')
Pandas(Index=13, _1='halobacteria', order='halobacteriales')
Pandas(Index=14, _1='methanococci', order='methanococcales')
Pandas(Index=15, _1='methanopyri', order='methanopyrales')
Pandas(Index=16, _1='t

** Now lets manage the index of the DataFrame **

For now, we consider that every column but the value can be considered as a metadata. 

In [227]:
indexed_data = clean_data.set_index(list(scientific_classification))
indexed_data.head(5)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0,Unnamed: 5_level_0,MID1,MID7,MID3,MID6,MID2,MID9,MID5,MID8,MID4
domain,phylum,class,order,family,genus,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
archaea,crenarchaeota,thermoprotei,desulfurococcales,desulfurococcaceae,ignisphaera,7,2,3,7,3,2,3,unknown,7
archaea,crenarchaeota,thermoprotei,desulfurococcales,pyrodictiaceae,pyrolobus,2,1,unknown,3,1,unknown,1,unknown,unknown
archaea,crenarchaeota,thermoprotei,sulfolobales,sulfolobaceae,stygiolobus,3,1,1,7,1,7,1,7,unknown
archaea,crenarchaeota,thermoprotei,thermoproteales,thermofilaceae,thermofilum,3,1,1,1,1,1,4,unknown,unknown
archaea,euryarchaeota,methanomicrobia,methanocellales,methanocellaceae,methanocella,7,1,2,1,2,1,4,unknown,3


In [228]:
indexed_data.index.is_unique

True

** Now lets give the column to more meaningfull names **

The ordering of the columns is already managed to give a nice output. This is why we reordered the metadata earlier on.

In [229]:
pretty_data = pd.DataFrame(data=indexed_data.values,
                           index=indexed_data.index,
                           columns=[metadata.group_phase.get_values(),
                                    metadata.group_type.get_values(),
                                    metadata["sample"].get_values()])
pretty_data.head(5)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0,Unnamed: 5_level_0,Unnamed: 6_level_0,1,1,1,1,2,2,2,2
Unnamed: 0_level_1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,EXTRACTION CONTROL,Control,Control,NEC,NEC,Control,Control,NEC,NEC
Unnamed: 0_level_2,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,stool,tissue,stool,tissue,stool,tissue,stool,tissue
domain,phylum,class,order,family,genus,Unnamed: 6_level_3,Unnamed: 7_level_3,Unnamed: 8_level_3,Unnamed: 9_level_3,Unnamed: 10_level_3,Unnamed: 11_level_3,Unnamed: 12_level_3,Unnamed: 13_level_3,Unnamed: 14_level_3
archaea,crenarchaeota,thermoprotei,desulfurococcales,desulfurococcaceae,ignisphaera,7,2,3,7,3,2,3,unknown,7
archaea,crenarchaeota,thermoprotei,desulfurococcales,pyrodictiaceae,pyrolobus,2,1,unknown,3,1,unknown,1,unknown,unknown
archaea,crenarchaeota,thermoprotei,sulfolobales,sulfolobaceae,stygiolobus,3,1,1,7,1,7,1,7,unknown
archaea,crenarchaeota,thermoprotei,thermoproteales,thermofilaceae,thermofilum,3,1,1,1,1,1,4,unknown,unknown
archaea,euryarchaeota,methanomicrobia,methanocellales,methanocellaceae,methanocella,7,1,2,1,2,1,4,unknown,3


In [230]:
#Lets give a name to the columns
pretty_data.columns.names = ["Group Number", "Group Type", "Sample"]
pretty_data.head(10)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0,Group Number,Unnamed: 6_level_0,1,1,1,1,2,2,2,2
Unnamed: 0_level_1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Group Type,EXTRACTION CONTROL,Control,Control,NEC,NEC,Control,Control,NEC,NEC
Unnamed: 0_level_2,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Sample,Unnamed: 6_level_2,stool,tissue,stool,tissue,stool,tissue,stool,tissue
domain,phylum,class,order,family,genus,Unnamed: 6_level_3,Unnamed: 7_level_3,Unnamed: 8_level_3,Unnamed: 9_level_3,Unnamed: 10_level_3,Unnamed: 11_level_3,Unnamed: 12_level_3,Unnamed: 13_level_3,Unnamed: 14_level_3
archaea,crenarchaeota,thermoprotei,desulfurococcales,desulfurococcaceae,ignisphaera,7,2,3,7,3,2,3,unknown,7
archaea,crenarchaeota,thermoprotei,desulfurococcales,pyrodictiaceae,pyrolobus,2,1,unknown,3,1,unknown,1,unknown,unknown
archaea,crenarchaeota,thermoprotei,sulfolobales,sulfolobaceae,stygiolobus,3,1,1,7,1,7,1,7,unknown
archaea,crenarchaeota,thermoprotei,thermoproteales,thermofilaceae,thermofilum,3,1,1,1,1,1,4,unknown,unknown
archaea,euryarchaeota,methanomicrobia,methanocellales,methanocellaceae,methanocella,7,1,2,1,2,1,4,unknown,3
archaea,euryarchaeota,methanomicrobia,methanosarcinales,methanosarcinaceae,methanimicrococcus,1,2,12,4,1,2,2,unknown,unknown
archaea,euryarchaeota,methanomicrobia,methanosarcinales,methermicoccaceae,methermicoccus,1,4,2,2,12,4,1,unknown,unknown
archaea,euryarchaeota,archaeoglobi,archaeoglobales,archaeoglobaceae,ferroglobus,1,1,unknown,4,2,1,1,unknown,3
archaea,euryarchaeota,archaeoglobi,archaeoglobales,archaeoglobaceae,geoglobus,1,12,unknown,unknown,unknown,12,1,unknown,unknown
archaea,euryarchaeota,halobacteria,halobacteriales,halobacteriaceae,haloplanus,1,2,1,1,1,2,2,unknown,unknown


In [231]:
# Lets export the resulting data for easy referencing
pretty_data.to_csv("resulting_data.csv")

## Task 3. Class War in Titanic

Use pandas to import the data file `Data/titanic.xls`. It contains data on all the passengers that travelled on the Titanic.

In [232]:
from IPython.core.display import HTML
#HTML(filename=DATA_FOLDER+'/titanic.html')

For each of the following questions state clearly your assumptions and discuss your findings:
1. Describe the *type* and the *value range* of each attribute. Indicate and transform the attributes that can be `Categorical`. 
2. Plot histograms for the *travel class*, *embarkation port*, *sex* and *age* attributes. For the latter one, use *discrete decade intervals*. 
3. Calculate the proportion of passengers by *cabin floor*. Present your results in a *pie chart*.
4. For each *travel class*, calculate the proportion of the passengers that survived. Present your results in *pie charts*.
5. Calculate the proportion of the passengers that survived by *travel class* and *sex*. Present your results in *a single histogram*.
6. Create 2 equally populated *age categories* and calculate survival proportions by *age category*, *travel class* and *sex*. Present your results in a `DataFrame` with unique index.

In [233]:
# Write your answer here