# Table of Contents
 <p><div class="lev1"><a href="#Task-1.-Compiling-Ebola-Data"><span class="toc-item-num">Task 1.&nbsp;&nbsp;</span>Compiling Ebola Data</a></div>
 <div class="lev1"><a href="#Task-2.-RNA-Sequences"><span class="toc-item-num">Task 2.&nbsp;&nbsp;</span>RNA Sequences</a></div>
 <div class="lev1"><a href="#Task-3.-Class-War-in-Titanic"><span class="toc-item-num">Task 3.&nbsp;&nbsp;</span>Class War in Titanic</a></div></p>

In [1]:
import pandas as pd
import numpy as np
import glob
import os
from dateutil import parser

In [2]:
DATA_FOLDER = '../../ADA2017-Tutorials/02 - Intro to Pandas/Data/' # Use the data folder provided in Tutorial 02 - Intro to Pandas.

## Task 1. Compiling Ebola Data

The `DATA_FOLDER/ebola` folder contains summarized reports of Ebola cases from three countries (Guinea, Liberia and Sierra Leone) during the recent outbreak of the disease in West Africa. For each country, there are daily reports that contain various information about the outbreak in several cities in each country.

Use pandas to import these data files into a single `Dataframe`.
Using this `DataFrame`, calculate for *each country*, the *daily average per month* of *new cases* and *deaths*.
Make sure you handle all the different expressions for *new cases* and *deaths* that are used in the reports.

## Explanation
A quick analysis of the dataset reveals that there are some inconsistencies, and therefore the first step will require some cleaning and some transformation steps. We can identify the following problems:
- The columns have different names among different countries. For example, the Guinea dataset contains the columns: "Date,Description,Totals", whereas Liberia contains "Date,Variable,National" (the columns have the same meaning).
- The CSV files have a variable number of columns. Fortunately, this is not an issue in our case, since we are only interested in totals.
- There are many missing values. These will be treated as 0.
- Dates are given in different formats, such as '2014-08-04' or '6/16/2014'.
- Some records contain newlines. This issue is automatically handled by Pandas.
- Row descriptions are inconsistent and are provided under different names. For instance, in Guinea we can find the following variables:
    - New cases of suspects
    - New cases of probables
    - New cases of confirmed
    - Total new cases registered so far *(the sum of the previous ones)*
    
    Conversely, in Liberia we can observe:
    - New Case/s (Suspected)
    - New Case/s (Probable)
    - New case/s (confirmed) *(notice the difference between "Case" and "case")*
    
    There are other examples of these irregularities. They will be matched together.
    
### Plan
The task will be subdivided into three parts:
1. Dataset cleaning. This subtask will produce a consistent, uniform dataset.
2. Pivoting: "new cases" and "new death" will be added as new columns.
3. Aggregation: it will return the required daily averages per month.

### Remarks
Since it was not specified, we made some assumptions: by "new cases" we mean the sum of suspect, probable, and confirmed cases. 

In [3]:
# Load the dataset
country_paths = glob.glob(DATA_FOLDER + 'ebola/*/')
country_names = [os.path.basename(os.path.normpath(x)).replace('_data', '').capitalize() for x in country_paths]

In [4]:
# Country names are derived from the directory name (N.B.: Sl = Sierra Leone)
print(country_names)

['Guinea', 'Liberia', 'Sl']


In [5]:
# As explained earlier, we need to map different aliases to a standard column name
aliases = {
    'variable': 'description',
    'national': 'totals'
}

In [6]:
dfs = []
for i, path in enumerate(country_paths):
    filenames = glob.glob(path + "/*.csv")
    for f in filenames:
        df = pd.read_csv(f)
        df.columns = [c.lower() for c in df.columns] # Set all columns to lower-case
        df.rename(columns=aliases, inplace=True)
        df['country'] = country_names[i]
        df['date'] = [parser.parse(x, tzinfos={'dayfirst': True}) for x in df['date']] # Parse date
        df['totals'] = pd.to_numeric(df['totals'], errors='coerce')
        df = df.fillna(0) # Fill empty values with zeros
        dfs.append(df[['date', 'country', 'description', 'totals']]) # Extract only relevant columns
        
df = pd.concat(dfs)
df['description'] = [x.lower() for x in df['description']]
df.head()

Unnamed: 0,date,country,description,totals
0,2014-08-04,Guinea,new cases of suspects,5.0
1,2014-08-04,Guinea,new cases of probables,0.0
2,2014-08-04,Guinea,new cases of confirmed,4.0
3,2014-08-04,Guinea,total new cases registered so far,9.0
4,2014-08-04,Guinea,total cases of suspects,11.0


In [7]:
# Now we extract the relevant variables and we map them to a standard name (new_cases and new_deaths)
column_keywords = {
    'new_cases': ['new cases of suspects', 'new cases of probables', 'new cases of confirmed', 'new case/s (suspected)',
                 'new case/s (probable)', 'new case/s (confirmed)',  'new_suspected', 'new_probable', 'new_confirmed'],
    'new_deaths': ['new deaths registered today', 'new deaths registered', 'newly reported deaths',
                  'etc_new_deaths']
}

desc_map = []
for k, v in column_keywords.items():
    for alias in v:
        desc_map.append((k, alias))
desc_map = pd.DataFrame(desc_map, columns=['type', 'description'])
df = df.merge(desc_map, on='description', how='inner')
df = df.drop('description', axis=1)

In [8]:
# This is much better
df.head(20)

Unnamed: 0,date,country,totals,type
0,2014-08-04,Guinea,5.0,new_cases
1,2014-08-26,Guinea,18.0,new_cases
2,2014-08-27,Guinea,12.0,new_cases
3,2014-08-30,Guinea,15.0,new_cases
4,2014-08-31,Guinea,9.0,new_cases
5,2014-09-02,Guinea,11.0,new_cases
6,2014-09-04,Guinea,13.0,new_cases
7,2014-09-07,Guinea,5.0,new_cases
8,2014-09-08,Guinea,5.0,new_cases
9,2014-09-09,Guinea,9.0,new_cases


In [9]:
# The dataframe is pivoted so as to transform new_cases and new_deaths types to columns
# Furthermore, equivalent types are aggregated with a sum (e.g. suspected + probable + confirmed)
pivoted = df.pivot_table(index=['date', 'country'], columns='type', values='totals', aggfunc='sum')
pivoted.columns.name = None
df = pivoted.reset_index()

In [10]:
# Some records from Liberia
df[df.country == 'Liberia'].head()

Unnamed: 0,date,country,new_cases,new_deaths
0,2014-06-16,Liberia,4.0,2.0
1,2014-06-17,Liberia,2.0,0.0
2,2014-06-22,Liberia,10.0,4.0
3,2014-06-24,Liberia,6.0,4.0
4,2014-06-25,Liberia,7.0,3.0


In [11]:
# Group by country, year, month, and aggregate with the mean value. This produces the daily average per country/year/month.
df_ = df.groupby([df.country, df.date.dt.year, df.date.dt.month]).mean()
df_.index.names = ['country', 'year', 'month']
df_.reset_index().sort_values(ascending=[True, True, True], by=['country', 'year', 'month'])

Unnamed: 0,country,year,month,new_cases,new_deaths
0,Guinea,2014,8,25.8,3.4
1,Guinea,2014,9,19.625,3.5625
2,Guinea,2014,10,34.0,15.0
3,Liberia,2014,6,5.714286,2.0
4,Liberia,2014,7,8.545455,4.272727
5,Liberia,2014,8,37.222222,23.222222
6,Liberia,2014,9,63.833333,36.041667
7,Liberia,2014,10,45.56,28.04
8,Liberia,2014,11,26.466667,13.466667
9,Liberia,2014,12,5178.555556,0.0


### Final observations
The number of new cases in Liberia, in December 2014, seems too high. However, a manual inspection revealed that this figure is reflected in the original dataset.

## Task 2. RNA Sequences

In the `DATA_FOLDER/microbiome` subdirectory, there are 9 spreadsheets of microbiome data that was acquired from high-throughput RNA sequencing procedures, along with a 10<sup>th</sup> file that describes the content of each. 

Use pandas to import the first 9 spreadsheets into a single `DataFrame`.
Then, add the metadata information from the 10<sup>th</sup> spreadsheet as columns in the combined `DataFrame`.
Make sure that the final `DataFrame` has a unique index and all the `NaN` values have been replaced by the tag `unknown`.

In [12]:
path = DATA_FOLDER + "microbiome"
metadata = glob.glob(path + "/metadata.xls")
filenames = glob.glob(path + "/MID*.xls")

rnaData = []
for i, file in enumerate(filenames):
    currData = pd.read_excel(file, header=None, names=["TAXON", "COUNT"])
    currData["BARCODE"] = "MID"+str(i+1) # Add a column value to do the join
    rnaData.append(currData)

result = pd.concat(rnaData)
result.head()

Unnamed: 0,TAXON,COUNT,BARCODE
0,"Archaea ""Crenarchaeota"" Thermoprotei Desulfuro...",7,MID1
1,"Archaea ""Crenarchaeota"" Thermoprotei Desulfuro...",2,MID1
2,"Archaea ""Crenarchaeota"" Thermoprotei Sulfoloba...",3,MID1
3,"Archaea ""Crenarchaeota"" Thermoprotei Thermopro...",3,MID1
4,"Archaea ""Euryarchaeota"" ""Methanomicrobia"" Meth...",7,MID1


In [13]:
# Perform an inner join on BARCODE, so as to match the metadata with the actual content
metadata = pd.read_excel(metadata[0])
result = pd.merge(result, metadata, how="inner", on="BARCODE")
result.head()

Unnamed: 0,TAXON,COUNT,BARCODE,GROUP,SAMPLE
0,"Archaea ""Crenarchaeota"" Thermoprotei Desulfuro...",7,MID1,EXTRACTION CONTROL,
1,"Archaea ""Crenarchaeota"" Thermoprotei Desulfuro...",2,MID1,EXTRACTION CONTROL,
2,"Archaea ""Crenarchaeota"" Thermoprotei Sulfoloba...",3,MID1,EXTRACTION CONTROL,
3,"Archaea ""Crenarchaeota"" Thermoprotei Thermopro...",3,MID1,EXTRACTION CONTROL,
4,"Archaea ""Euryarchaeota"" ""Methanomicrobia"" Meth...",7,MID1,EXTRACTION CONTROL,


In [14]:
# Replace missing values (NaNs) with "unknown"
result = result.fillna("unknown")
# Make sure that there are no NaNs
print(pd.isnull(result).sum()) 

TAXON      0
COUNT      0
BARCODE    0
GROUP      0
SAMPLE     0
dtype: int64


In [15]:
# seT an index that is unique (the combination of TAXON and BARCODE is suitable)
result.set_index(["TAXON","BARCODE"], inplace=True)
print(result.index.is_unique) # Check uniqueness

# Show a sample of the transformed dataset
result.sample(10)

True


Unnamed: 0_level_0,Unnamed: 1_level_0,COUNT,GROUP,SAMPLE
TAXON,BARCODE,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
"Bacteria ""Firmicutes"" ""Clostridia"" Clostridiales Veillonellaceae Schwartzia",MID3,1,Control 1,tissue
"Bacteria ""Actinobacteria"" Actinobacteria Actinomycetales Promicromonosporaceae Xylanibacterium",MID1,2,EXTRACTION CONTROL,unknown
"Bacteria ""Proteobacteria"" Gammaproteobacteria ""Enterobacteriales"" Enterobacteriaceae Samsonia",MID5,237,Control 2,tissue
"Bacteria ""Verrucomicrobia"" Opitutae Puniceicoccales Puniceicoccaceae Cerasicoccus",MID7,39,Control 1,stool
"Bacteria ""Bacteroidetes"" ""Bacteroidia"" ""Bacteroidales"" ""Rikenellaceae"" Rikenella",MID7,1,Control 1,stool
"Bacteria ""Bacteroidetes"" Flavobacteria ""Flavobacteriales"" Flavobacteriaceae Nonlabens",MID4,1,NEC 2,tissue
"Bacteria ""Actinobacteria"" Actinobacteria Actinomycetales Microbacteriaceae Microbacterium",MID2,1,NEC 1,tissue
"Bacteria ""Firmicutes"" ""Clostridia"" Clostridiales Veillonellaceae Propionispira",MID8,6,NEC 2,stool
"Bacteria ""Bacteroidetes"" Flavobacteria ""Flavobacteriales"" Flavobacteriaceae Zeaxanthinibacter",MID2,1,NEC 1,tissue
"Bacteria ""Firmicutes"" ""Bacilli"" Bacillales Bacillaceae Thalassobacillus",MID8,1,NEC 2,stool


## Task 3. Class War in Titanic

Use pandas to import the data file `Data/titanic.xls`. It contains data on all the passengers that travelled on the Titanic.

In [None]:
from IPython.core.display import HTML
HTML(filename=DATA_FOLDER+'/titanic.html')

For each of the following questions state clearly your assumptions and discuss your findings:
1. Describe the *type* and the *value range* of each attribute. Indicate and transform the attributes that can be `Categorical`. 
2. Plot histograms for the *travel class*, *embarkation port*, *sex* and *age* attributes. For the latter one, use *discrete decade intervals*. 
3. Calculate the proportion of passengers by *cabin floor*. Present your results in a *pie chart*.
4. For each *travel class*, calculate the proportion of the passengers that survived. Present your results in *pie charts*.
5. Calculate the proportion of the passengers that survived by *travel class* and *sex*. Present your results in *a single histogram*.
6. Create 2 equally populated *age categories* and calculate survival proportions by *age category*, *travel class* and *sex*. Present your results in a `DataFrame` with unique index.

In [17]:
# Write your answer here