# Modeling Dengue
This is an assignment for INFO 370 Core Method in Data Science (Autumn 2018) at University of Washington. The task for this project is to measure the **strength of association** between different environmental variables and the number of cases of Dengue. This Jupyter Notebook is a **polished report** of our analysis on Dengue (source of dataset: [link](https://www.drivendata.org/competitions/44/dengai-predicting-disease-spread)).

- Team members: Rhea Chen, Matthew Wong.
- Github repo: [link](https://github.com/info370a-au18/a3-rheaxychen)

## Set Up

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns # for visualiation
from scipy.stats import ttest_ind # t-tests
from scipy.stats import pearsonr
import statsmodels.formula.api as smf # linear modeling
import statsmodels.api as sm
import matplotlib.pyplot as plt # plotting
import matplotlib
from sklearn import metrics

## Problem Overview
   Dengue fever is the leading cause of illness and death in the tropics and subtropics, regions which contain a little more than one-third of the world population. Mild cases of Dengue fever can cause high fever, rash and muscle, and joint pain, while more severe cases can lead to severe bleeding, a sudden drop in blood pressure and death. Dengue is transmitted to people by mosquitoes, and because of this relationship, the disease’s spread spikes yearly when rainfall is optimal for mosquitoe breeding. In fact, this relationship between the disease and the mosquito breeding period is so entwined that, every breeding season Dengue is classified as an epidemic within the tropics and sub-tropics. This relationship between the disease and mosquitoes also means that environmental factors that promote mosquito life will also work as indicators for the spread of Dengue. Limiting exposure to mosquitos in at-risk areas is the only known prevention method for the disease, as a vaccine has yet to be developed. This coupled with the fact that the disease is rapidly spreading, going from having cases in 9 countries in 1970 to over 100 countries by 2015, makes it a high priority for many world health organizations. In order to perform our predictions, we were given access to data from a variety of U.S. Federal Government agencies. 
   
### Pertinent Variables 
Because dengue is carried by mosquitoes, it’s transmission dynamics are related to climate variables such as temperature and precipitation. To predict the number of dengue fever cases (the `total_cases` label for each `(city, year, weekofyear)` ) based on the provided datasets, it’s important to look at these following information ([source](https://www.drivendata.org/competitions/44/dengai-predicting-disease-spread/page/82/#features_list)):

#### City and Date Indicators
- There are data for _two cities_, San Juan (`sj`) and Iquitos (`iq`) in the `city` column. 
- The date indicator is provided on a `(year, weekofyear, week_start_date)` timescale, which is given in yyyy-mm-dd format.

#### NOAA's GHCN daily climate data weather station measurements
- `station_avg_temp_c` – _Average_ temperature
- `station_precip_mm` – _Total_ precipitation

#### PERSIANN satellite precipitation measurements (0.25x0.25 degree scale)
- `precipitation_amt_mm` – _Total_ precipitation

#### NOAA's NCEP Climate Forecast System Reanalysis measurements (0.5x0.5 degree scale)
- `reanalysis_sat_precip_amt_mm` – _Total_ precipitation
- `reanalysis_precip_amt_kg_per_m2` – _Total_ precipitation **(? same as ↾)**
- `reanalysis_dew_point_temp_k` – _Mean_ dew point temperature
- `reanalysis_air_temp_k` – _Mean_ air temperature
- `reanalysis_avg_temp_k` – _Average_ air temperature **(? same as ↾)**
- `reanalysis_relative_humidity_percent` – _Mean_ relative humidity
- `reanalysis_specific_humidity_g_per_kg` – _Mean_ specific humidity
   

### Citations:
   1. Centers for Disease Control and Prevention. (n.d.). Retrieved from [link](https://www.cdc.gov/).
   2. Here's the ideal temp for mosquito-borne diseases. (2017, May 05). Retrieved from [link](https://www.futurity.org/climate-change-mosquito-diseases-1420452/).
   3. 
   

## Data Preperation

(Criteria: _Properly loads, manages, and wrangles data. Creates at least one new variable and deals with missing values._ 10pts)


As you can see, there are a few columns that represent the same data point (e.g. `precipitation`) from different sources. Are there any _substantive_ differences from the sources and why would they all be included?

In [6]:
# Load the data from .csv file
dengue_features = pd.read_csv("data/dengue_features_train.csv")
dengue_labels = pd.read_csv("data/dengue_labels_train.csv")

In [7]:
# Check how many rows has empty values in `dengue_features` dataset
dengue_features.isna().sum()

city                                       0
year                                       0
weekofyear                                 0
week_start_date                            0
ndvi_ne                                  194
ndvi_nw                                   52
ndvi_se                                   22
ndvi_sw                                   22
precipitation_amt_mm                      13
reanalysis_air_temp_k                     10
reanalysis_avg_temp_k                     10
reanalysis_dew_point_temp_k               10
reanalysis_max_air_temp_k                 10
reanalysis_min_air_temp_k                 10
reanalysis_precip_amt_kg_per_m2           10
reanalysis_relative_humidity_percent      10
reanalysis_sat_precip_amt_mm              13
reanalysis_specific_humidity_g_per_kg     10
reanalysis_tdtr_k                         10
station_avg_temp_c                        43
station_diur_temp_rng_c                   43
station_max_temp_c                        20
station_mi

In [3]:
# Remove all rows with na values
dengue_features = dengue_features.dropna()
# Add column that is average of all three temperature measurements
dengue_features["avg_air_temp_k"] = (dengue_features["reanalysis_air_temp_k"] + dengue_features["reanalysis_avg_temp_k"] + (dengue_features["station_avg_temp_c"] + 273.15)) / 3
dengue_features["avg_total_precipiation"]
merged_data = pd.merge(dengue_features, dengue_labels, on=['city', 'weekofyear', 'year'], how='outer')
# Preview the new dataframe
merged_data.head()

Unnamed: 0,city,year,weekofyear,week_start_date,ndvi_ne,ndvi_nw,ndvi_se,ndvi_sw,precipitation_amt_mm,reanalysis_air_temp_k,...,reanalysis_sat_precip_amt_mm,reanalysis_specific_humidity_g_per_kg,reanalysis_tdtr_k,station_avg_temp_c,station_diur_temp_rng_c,station_max_temp_c,station_min_temp_c,station_precip_mm,avg_air_temp_k,total_cases
0,sj,1990,18,1990-04-30,0.1226,0.103725,0.198483,0.177617,12.42,297.572857,...,12.42,14.012857,2.628571,25.442857,6.9,29.4,20.0,16.0,297.969524,4
1,sj,1990,19,1990-05-07,0.1699,0.142175,0.162357,0.155486,22.82,298.211429,...,22.82,15.372857,2.371429,26.714286,6.371429,31.7,22.2,8.6,298.839524,5
2,sj,1990,20,1990-05-14,0.03225,0.172967,0.1572,0.170843,34.54,298.781429,...,34.54,16.848571,2.3,26.714286,6.485714,32.2,22.8,41.4,299.174762,4
3,sj,1990,21,1990-05-21,0.128633,0.245067,0.227557,0.235886,15.36,298.987143,...,15.36,16.672857,2.428571,27.471429,6.771429,33.3,23.3,4.0,299.612381,3
4,sj,1990,22,1990-05-28,0.1962,0.2622,0.2512,0.24734,7.52,299.518571,...,7.52,17.21,3.014286,28.942857,9.371429,35.0,23.9,5.8,300.425238,6


## Exploratory Data Analysis

(Criteria: _Creates at least 5 well designed and clearly titled/labeled visualizations to explore the data, including univariate and multivariate explorations. Insights from the visualizations are clearly documented._ 20pts)

Below are 5 exploratory visualizations that expose pertinent information about the dataset.

1. What is the distribution of the number of cases of Dengue each week?

> The above scatterplot shows a **moderate**, **positive**, **non-linear** association between _week of year_ and the number of _total cases_ for both _cities_. <TODO> There appear to be any outliers in the data of San Juan (`sj`).

2. How does the number of cases fluctuate over time? Do these temporal relationships persist in both locations?

> 

3. Which variables are correlated with your outcome of interest (total_cases)? Are these correlations consistent in both cities (you may want to calculate this)?

> 

## Statistical Modeling



## Interpretation