# DOPP 2018W Exercise 2

Before we get started, please set the variables above to your student ID and name:

In [None]:
studentID = '00000000'
name = 'Your Name'

This is the template for the second exercise in data oriented programming paradigms (2018W). 
Before you get started, please read the instructions in this notebook carefully.

### Preliminaries:
 - In order to get a valid score, you must rename this file from `exercise_2.ipynb` to `%s_exercise_2.ipynb % student_id`.

- Please use only Python version 3 (3.6+ recommended). It is recommended to install Anaconda or Miniconda. 

- Most of the code in this notebook will be scored using unit tests. 
- Please use the code stubs provided, do not rename any functions, and add and modify your code only at the provided markers. 
- Check and make sure that your submission executes without any errors before submitting it
- The submission will be executed on a Unix system (if you use Windows, please make sure that you use the functionality provided in the os module to make sure your path names work on Unix)
- For the submission, only this (renamed) notebook file needs to be uploaded to TUWEL. The data will be available on the same path in the directory we put your notebook for grading.


The submission deadline is **12.12.2018 23:55.**

In [15]:
# Note: The only imports allowed are those contained in Python's standard library, pandas, numpy, scipy and matplotlib
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

#import sklearn...

## Goal
In this exercise, you will 
 * use `pandas` to read, prepare and transform data,
 * use `matplotlib` to visually alayse data,
 * use `scikit-learn` to build prediction models.


The goal of this exercise is to model the relationship between weather observations and the prevalence of new influenza infections.

To investigate a potential relationship, we will use two datasets:
 * daily weather observation data in Vienna (2012-2018)
 * weekly reports on [new influenza infections](https://www.data.gv.at/katalog/dataset/grippemeldedienst-stadt-wien) in Vienna (2009-2018).

Note that the weather data set differs from the one used in exercise 1 and be sure to use the one provided for exercise 2. The data to be used can be found in the subdirectory named `data`. 
If you develop your submission on Windows, please make sure that you don't use any backslashes in the file names, because the submission won't run on Unix systems. 
Either use normal slashes, or use the functions provided in the `os.path` module. 
If you stick with the provided function templates, you should be fine.

To complete this exercise, you will have to:
* prepare the data, which (at a minimum) involves the following:
    - handling missing values,
    - handling outliers
    - temporal alignment (i.e. convert daily data to weekly data using appropriate aggregation functions),
* analyse the data:
    - compare descriptive statistics,
    - visually investigate the raw data to gain an understanding of the data, identify patterns, outliers etc.,
    - look at the relationship between the variables of interest,
* model the relationship:
    - fit a model that predicts new infections from weather observation data.

## Task 1: Load Data

### Weather observations <span style="color:blue">(1 P)</span>

As a first step, implement the method `load_weather_data()`, which should read all individual (yearly) data sets from the csv files in `data` into a single `pd.DataFrame` and return it. 

- add a column for the year
- add a `week` column containing the week number (use Pandas built-in datetime handling features to get the week number for each given date)
- create a `MultiIndex` from the date columns with the following hierarchy: `year` - `month` - `week` - `day` (make sure to label them accordingly)
- make sure that all columns are appropriately typed
- make sure that you load all the data (2012-2018)

**Hints:**
 - It is advisable not to append each data set individually, but to read each data frame, store it into a list and  combine them once at the end.
 - Your resulting data frame should look as follows:
 
![Weather data frame example](weather_dataFrame_example.png)

In [None]:
def load_weather_data():
    """ 
    Load all weather data files and combine them into a single Pandas DataFrame.
    Add a week column and a hierarchical index (year, month, week, day)
    
    Returns
    --------
    weather_data: data frame containing the weather data
    """
    # TODO: your changes here
    weather_data = pd.DataFrame()
    return weather_data

data_weather = load_weather_data()

### Influenza infections <span style="color:blue">(1 P)</span>

Load and prepare the second data set, which contains the number of new influenza infections on a weekly basis, as follows:

- get rid of all columns except `Neuerkrankungen pro Woche`, `Jahr`, and `Kalenderwoche`
- rename `Neuerkrankungen pro Woche` to `weekly_infections`
- create a `MultiIndex` from the `Jahr` (→ `year`) and `Kalenderwoche` (→ `week`) columns
- make sure that all columns are appropriately typed
- your resulting data frame should look as follows:


![Example data frame](influenza_dataFrame_example.png)

In [None]:
def load_influenza_data():
    """ 
    Load and prepare the influenza data file
    
    Returns
    --------
    influenza_data: data frame containing the influenza data
    """
    # TODO: your changes here
    influenza_data = pd.DataFrame()
    
    return influenza_data

data_influenza = load_influenza_data()

## Task 2: Handling Missing values <span style="color:blue">(4 P)</span>

If you take a closer look at the data, you will notice that a few of the observations are missing.

There are a wide range of standard strategies to deal with such missing values, including:

- row deletion
- substitution methods (e.g., replace with mean or median)
- hot-/cold-deck methods (impute from a randomly selected similar record)
- regression methods

To decide which strategy is appropriate, it is important to investigate the mechanism that led to the missing values to find out whether the missing data is missing completely at random (MCAR), missing at random (MAR), or missing not at random (MNAR). 

 - **MCAR** means that there is no relationship between the missingness of the data and any of the values.
 - **MAR** means that that there is a systematic relationship between the propensity of missing values and the observed data, but not the missing data.
 - **MNAR** means that there is a systematic relationship between the propensity of a value to be missing and its values. 

To find out more about what mechanisms may have caused the missing values, you talked to the metereologist that compiled the data. 
She told you that she does not know why some of the temperature readings are missing, but that it may be that someone forgot to record them. In any case, it is likely that the propensity of a temperature value to be missing does not have anything to do with the weather itself.

As far as the missing humidity readings are concerned, she says that according to her experience, she suspects that the humidity sensor is less reliable in hot weather.

The missing wind speed and direction sensor readings were due to a hardware defect.

Check the plausibility of these hypotheses in the data, consider the implications, and devise an appropriate strategy to deal with the various missing values.

To implement your strategy, you can use a range of standard mechanisms provided by Pandas or implement a custom strategy (for extra points):

**Chocolate challenge:**
More advanced strategies (e.g., building a machine learning model that predicts MAR data based on other observed values) may give you better results if the propensity of data to be missing is of a systematic nature. If you want to participate in the chocolate challenge for the best approximation of missing values, you can implement your missing value prediction model in `handleMissingValuesAdvanced` below. 

In [None]:
def handle_missingValues_simple(incomplete_data):
    """ 
    Parameters
    --------
    incomplete_data: data frame containing missing values 
    
    Returns
    --------
    complete_data: data frame not containing any missing values
    """
    # TODO: your changes here
    complete_data = incomplete_data.fillna(0)
    
    return complete_data


def handle_missingValues_advanced (incomplete_data):
    """ 
    Parameters
    --------
    data: data frame containing missing values 
    
    Returns
    --------
    data: data frame not containing any missing values
    """
    # TODO: your changes here
    
    return complete_data
    
data_weather_complete = handle_missingValues_simple(data_weather)

###  Discussion

#### Pros and Cons of strategies for dealing with missing data <span style="color:blue">(1 P)</span>

In the cell provided below, discuss the PROs and CONs of various strategies (row deletion, imputation, hot deck methods etc.) for dealing with missing data. Discuss when it is appropriate to use each method.

[T2_Pros_and_Cons]

<span style="color:blue">**TODO:**</span> Please remove this text, but keep the marker above and answer here...

[/T2_Pros_and_Cons]

#### Your chosen strategy <span style="color:blue">(1 P)</span>

Explain your chosen strategy for dealing with missing values for the various attributes in the cell below.


[T2_MissingValueStrategy]

<span style="color:blue">**TODO:**</span> Please remove this text, but keep the marker above and answer here...

[/T2_MissingValueStrategy]

## Task 3: Handling Outliers <span style="color:blue">(2 P)</span>

If you take a closer look at some of the observations, you should notice that some of the temperature values are not particularly plausible (hint: plotting histograms of the distributions helps). Hypothesize on the nature of these outliers and implement a strategy to handle them.

In [None]:
def handle_outliers(noisy_data):
    """ 
    Parameters
    --------
    noisy_data: data frame that contains outliers
    
    Returns
    --------
    cleaned_data: data frame with outliers
    """
    # TODO: your changes here
    cleaned_data = noisy_data
    return cleaned_data
    
data_weather_cleaned = handle_outliers(data_weather_complete)

#### Your chosen strategy <span style="color:blue">(1 P)</span>

Explain your chosen strategy for dealing with outliers in the cell below.


[T3_OutlierStrategy]

<span style="color:blue">**TODO:**</span> Please remove this text, but keep the marker above and answer here...

[/T3_OutlierStrategy]

## Task 4: Aggregate values <span style="color:blue">(1 P)</span>

Aggregate the observations on a weekly basis. Return a data frame with a hierarchical index (levels `year` and `week`) on the vertical axis and the following weekly aggregations as columns:

- `temp_weeklyMin`: minimum of `temp_dailyMin`
- `temp_weeklyMax`: mean of `temp_dailyMax`
- `temp_weeklyMean`: mean of `temp_dailyMean`
- `temp_7h_weeklyMedian`: median of `temp_7h`
- `temp_14h_weeklyMedian`: median of `temp_14h`
- `temp_19h_weeklyMedian`: median of `temp_19h`

- `hum_weeklyMean`: mean of `hum_dailyMean`
- `hum_7h_weeklyMedian`: median of `hum_7h`
- `hum_14h_weeklyMedian`: median of `hum_14h`
- `hum_19h_weeklyMedian`: median of `hum_19h`

- `precip_weeklyMean`: mean of `precip`
- `wind_mSec_mean`: mean of `wind_mSec`

In [None]:
def aggregate_weekly(data):
    """ 
    Parameters
    --------
    data: weather data frame
    
    Returns
    --------
    weekly_stats: data frame that contains statistics aggregated on a weekly basis
    """
    # TODO: your changes here
    weekly_weather_data = pd.DataFrame()    

    
    return weekly_weather_data

data_weather_weekly = aggregate_weekly(data_weather_cleaned)

## Task 5: Merge influenza and weather datasets <span style="color:blue">(1 P)</span>

Merge the `data_weather_weekly` and `data_influenza` datasets.

In [None]:

def merge_data(weather_df, influenza_df):
    """ 
    Parameters
    --------
    weather_df: weekly weather data frame
    influenza_df: influenza data frame
    
    Returns
    --------
    merged_data: merged data frame that contains both weekly weather observations and prevalence of influence infections
    """
    # TODO: your changes here
    merged_data = pd.DataFrame()    

    return merged_data

data_merged = merge_data(data_weather_weekly, data_influenza)

## Task 6: Visualization <span style="color:blue">(4 P)</span>

To get a better understanding of the dataset, create visualizations of the merged data set that help to explore the potential relationships between the variables before starting to develop a model.


**Note:** To hand in multiple figures, change the code accordingly (additional files should be named `%s_%u.png" % student_id, fileCount`).

In [1]:
# TODO: your changes here
fig = plt.figure()
ax = fig.add_axes([1,1,1,1])
plt.plot([1,2])


fig.savefig(studentID+'_01.png', bbox_inches='tight')

NameError: name 'plt' is not defined

## Task 7: Influenza prediction model <span style="color:blue">(11 P)</span>

Build a model to predict the number of influenza incidents for the year 2018 (discarding all the data available for 2018) based on data of previous year using `sklearn`. 

 - Choose appropriate machine learning algorithm(s) for the problem at hand
 - Make sure your results are reproducible
 - Don't hesitate to go back to previous steps if you notice any data quality issues
 - If your chosen algorithm has specific parameters, explore their effect with different settings using 10-fold cross-validation
 - Experiment with different training/test splits
 - If appropriate, try different scaling approaches (min/max, z-score,..).
 - How good does your model fit when you evaluate it with the test data set?
 - How good are your predictions when you use the actual data available for 2018 as a validation set?
 

In [None]:
#TODO: your model implementation here

#### Approach and algorithm <span style="color:blue">(2 P)</span>
Motivate your approach and choice of algorithm here:

[T7_Model_Description]

<span style="color:blue">**TODO:**</span> Please remove this text with your answer here, but keep the marker above.

[/T7_Model_Description]

#### Findings  <span style="color:blue">(2 P)</span>
Summarize your findings and lessens learned.

[T7_Model_Findings]

<span style="color:blue">**TODO:**</span> Please remove this text with your answer here, but keep the marker above.

[/T7_Model_Findings]