# Week 02 Assignment glucose level data

Welcome to week two of this course programming 1. You will learn about time related data wrangling with pandas and you will learn to visualize with bokeh. This week will be focussed around missing data. Concretely, you will preprocess the glucose json file using interpolation to impute in order to conduct visual analysis. Learning outcomes:

- load a json dataset 
- typecast the Pandas DataFrame to appropiate data types
- inspect the dataset for quality and metadata information
- add a column with interpolated data in Pandas DataFrame
- perform visual analysis

The assignment consists of 6 parts:

- [part 1: load the data](#0)
     - [Exercise 1.1](#ex-11)
- [part 2: prepare for inspection](#1)
     - [Exercise 2.1](#ex-21)
- [part 3: inspect the data](#2)
     - [Exercise 3.1](#ex-31)
- [part 4: interpolate the data](#3)
     - [Exercise 4.1](#ex-41)
- [part 5: visualize the data](#4)
     - [Exercise 5.1](#ex-51)
- [part 6: Challenge](#5)
     - [Exercise 6.1](#ex-61)

Part 1 and 5 are mandatory, part 6 is optional (bonus)
To pass the assingnment you need to a score of 60%. 


In [192]:
# IMPORTS
import pandas as pd
import numpy as np
import json

<a name='0'></a>
## Part 1: Load the data

Instructions: Load the json datafile `glucose.json` into a pandas dataframe. Check your dataframe with a `.head()` to compare with the expected outcome

<details>    
<summary>
    <font size="3" color="darkgreen"><b>Hints</b></font>
</summary>
<ul><li>json.load() method reads a file, pd.read_json converts it to a Pandas DataFrame</li>
    <li>when loading into a Pandas DataFrame use records orientation </li>
</ul>
</details>

<a name='ex-11'></a>
### Code your solution

In [193]:
# CODE YOUR SOLUTION HERE

# open en load json file
file = open("../data/glucose.json")
json_file = json.load(file)
file.close()

# Read json file in data frame
df_glucose = pd.read_json(json_file)
df_glucose.head()

Unnamed: 0,ID,time,recordtype,glucose
0,2845.0,2019-04-25 00:08,1,109.0
1,2850.0,2019-04-25 00:50,1,
2,2877.0,2019-04-25 07:02,1,123.0
3,2881.0,2019-04-25 07:34,1,158.0
4,2886.0,2019-04-25 08:19,1,


#### Expected outcome: 

<a name='1'></a>
## Part 2: Prepare the data

Check the datatypes of your dataframe. The `glucose` field should be an integer, the `time` field should have a datetime format. If the datatypes are different you should typecast them to the right format.
Make sure that your dataset is sorted by the time column


<details>    
<summary>
    <font size="3" color="darkgreen"><b>Hints</b></font>
</summary>
<ul><li>use astype() method or pandas.DataFrame.to_datetime() for instance</li>
    <li>make sure that the empty spaces are filled with NaN. Use errors='coerce'</li>
    <li>set_index(), sort_index() and reset_index() are helpful to sort on index</li>
</ul>
</details>

<a name='ex-21'></a>
### Code your solution

In [195]:
# CODE YOUR SOLUTION HERE
df_glucose.sort_values("time", inplace = True) # sort df based on time column

# Reset index
df_glucose.reset_index(drop=True, inplace = True)

df_glucose["ID"] = df_glucose["ID"].astype("int64") # set type to int64
df_glucose["time"] = df_glucose["time"].astype("datetime64") # set type to datatime64
# convert to numeric dtype, turn values that cause an error to NaN values.
df_glucose["glucose"] = pd.to_numeric(df_glucose["glucose"], errors = 'coerce').astype("float64")

df_glucose.dtypes

ID                     int64
time          datetime64[ns]
recordtype             int64
glucose              float64
dtype: object

#### Expected outcome: 

<a name='2'></a>
## Part 3: Inspect the data

Now that we prepared the data we are going to inspect the data to get more familiar with the data. You are required to do the following

- inspect the percentage missing data for glucose
- what is the relationship between recordtype and glucose value?
- what is the relationship between ID and glucose value?

Code the solutions to your answers. Create meaningful overviews or statistics

<details>    
<summary>
    <font size="3" color="darkgreen"><b>Hints</b></font>
</summary>
<ul><li>In the week 01 assignment some functions were explained to inspect missing values</li>
    <li>In the week 01 assignment some functions were explained to groupby value</li>
</ul>
</details>

<a name='ex-31'></a>
### Code your solution

In [196]:
#CODE YOUR SOLUTION HERE

n_missing_values = df_glucose["glucose"].isnull().sum() # number missing values glucose column
n_total_values = df_glucose["glucose"].shape[0] # number of total values
percentage_missing = n_missing_values / n_total_values
print(f"Percentage missing data: {percentage_missing}\n")


# Relation between recordtype and glucose
rec_0 = df_glucose[df_glucose["recordtype"] == 0][["ID", "glucose"]]

n_null_rec0 = rec_0.glucose.isnull().sum()

rec_1 = df_glucose[df_glucose["recordtype"] == 1][["ID", "glucose"]]
n_null_rec1 = rec_1.glucose.isnull().sum()

print(f"Number of vissing glucose values for recordtype 0: {n_null_rec0}, recordtype 1: {n_null_rec1}")
print(f"All glucose values of recordtype 0 are missing: {rec_0.glucose.isnull().all()}\n")

# Relation between ID and glucose value
id_glucose_df = df_glucose.loc[:, ["ID", "glucose"]] 
ID_miss = id_glucose_df[id_glucose_df.glucose.isnull()].ID
ID_normal = id_glucose_df[id_glucose_df.glucose.notnull()].ID

n_unique_ID_miss = ID_miss.nunique()
n_unique_ID_normal = ID_normal.nunique()

print(f"The number of unique IDs for the IDs that have missing glucose values are: {n_unique_ID_miss}")
print(f"The number of unique IDs for the IDs that have NORMAL glucose values are: {n_unique_ID_normal}\n")

# Although there are 84 missing values there are only 3 IDs that are coupled to these rows. Besides that there is one 
# ID which looks rather odd and is used multiple times. 

print(ID_miss.value_counts()) # Shows that one ID is connected to multiple rows

Percentage missing data: 0.6176470588235294

Number of vissing glucose values for recordtype 0: 82, recordtype 1: 2
All glucose values of recordtype 0 are missing: True

The number of unique IDs for the IDs that have missing glucose values are: 3
The number of unique IDs for the IDs that have NORMAL glucose values are: 52

-9223372036854775808    82
 2850                    1
 2886                    1
Name: ID, dtype: int64


## In summary, all of the rows that have recordtype 0(82 rows) contain missing glucose values. On the contrary only 2 rows contain missing values for rows that have recordtype 1 (54 rows). This might indicate that rows with recordtype 0 can be removed all together. There is also something suspicous with the relation of the ID and glucose column. Rows that have missing glucose values are only connected to three different IDs, nameley 2850, 2886 and -9223372036954775808. Where the IDs 2850 and 2886 only occur once and the other ID occurs as much as 82 times. This is caused by the fact that the ID column was changed from data type. It used to be float and was later turned into int64. The datatype int64 has less storage capacity than float64. Some ID numbers were very large and the int64 could therefore not hold the values. That is why there are 82 repititions of one ID because that is the max number an int64 can hold. All the rows that contain normal glucose values have a unique ID ranging from 2845 to 3062. 

#### Expected outcome percentage missing data
0.6176470588235294

<a name='3'></a>
## Part 4: Interpolate the data

A lot of data is missing. Use interpolation to fill the missing values. Create a new column with the interpolated data. Take an argumentative approach. Select an interpolation method that suits the nature of the data and explain your choice. Mind you that the expected outcome of the interpolation values can differ from the example below

<details>    
<summary>
    <font size="3" color="darkgreen"><b>Hints</b></font>
</summary>
<ul><li>use Pandas.DataFrame.interpolate() method</li>
</ul>
</details>

<a name='ex-41'></a>
### Code your solution

In [197]:
#CODE YOUR SOLUTION HERE
# Interpolate the missing data from the glucose column and assign it to a new column
df_glucose["interpolated"] = df_glucose['glucose'].interpolate(method = 'linear', limit_direction = 'forward')

df_glucose.head()

Unnamed: 0,ID,time,recordtype,glucose,interpolated
0,2845,2019-04-25 00:08:00,1,109.0,109.0
1,-9223372036854775808,2019-04-25 00:14:00,0,,109.466667
2,-9223372036854775808,2019-04-25 00:29:00,0,,109.933333
3,-9223372036854775808,2019-04-25 00:44:00,0,,110.4
4,2850,2019-04-25 00:50:00,1,,110.866667


Chose for the 'linear' method for interpolating, because it estimates missing values based on the increasing order of previous values. It also checks for the upcoming values. The 'limit_direction' arguments was set to forward, because the first value was known and can thus be used for interpolation. Limit was set to the default 0 because there are multiple NaN values in a row that we want to interpolate.  

#### Example outcome

<a name='4'></a>
## Part 5: Plot the data

Create a plot with the original data and the interpolated data. Consider what the best representation is for visualisation of actual values and modelled/imputed values. An example of such a plot is given below. This plot however is not considered the best practice. 

<details>    
<summary>
    <font size="3" color="darkgreen"><b>Hints</b></font>
</summary>
<ul><li>figure(x_axis_type='datetime') automatically makes nices labels of the datetime data</li>
</ul>
</details>

<a name='ex-51'></a>
### Code your solution

In [198]:
from bokeh.io import output_notebook
from bokeh.plotting import figure, show
from bokeh.layouts import gridplot, column
from bokeh.io import output_notebook
from bokeh.plotting import ColumnDataSource
from bokeh.models import DatetimeTickFormatter
output_notebook()

In [199]:
# CODE YOUR SOLUTION HERE
p = figure(plot_width = 500, plot_height = 400, tools="pan, hover", x_axis_type='datetime')

p.line(df_glucose['time'], df_glucose['interpolated'], legend_label = "Interpolated data")
p.scatter(df_glucose['time'], df_glucose['glucose'], color = "red", marker = "asterisk", size = 7, legend_label = "Glucose values")
p.xaxis.axis_label = 'Date/Time (hour)'
p.yaxis.axis_label = '(interpolated) Glucose levels (mmol/L)'
p.xaxis.formatter=DatetimeTickFormatter(days="%m/%d",
    months="%m/%d %H:%M",
    hours="%H:%M",
    minutes="%H:%M"
)
p.legend.location = "top_left"
p.legend.click_policy="hide"
show(p)

<a name='6'></a>
## Part 6: Challenge

It might even be interesting to introduce a widget in which you can select different methods to interpolate.
1. Can you improve the interpolation by choosing an other method?
2. Can you add an rolling mean line? 
2. Can you improve the plot by making it interactive?

<a name='ex-61'></a>
### Code your solution

In [200]:
# perform pchip interpolation
pchip_interpolation = df_glucose['glucose'].interpolate(method = 'pchip', limit_direction = 'forward')

In [201]:
#CODE YOUR SOLUTION HERE
from bokeh.models.widgets import Select
from bokeh.application import Application
from bokeh.application.handlers import FunctionHandler
from bokeh.models import CheckboxButtonGroup, CustomJS
from bokeh.models import Panel, Tabs

# Create a rolling mean for both interpolation methods
roll_mean_linear = df_glucose.interpolated.rolling(2, 1).mean()
roll_mean_pchip = pchip_interpolation.rolling(2, 1).mean()

def create_tab(x, y, y2, roll_mean, label):
    """
    Create a panel plot.
    
    :parameters
    -----------
    x - array-like
        The x axis array
    y - array-like
        The y axis array
    y2 - array like
        Array used for scatter plot
    roll_mean - array-like
        Array of the rolling mean
    label - String
        Description of the interpolation method
        
    :returns
    --------
    tab - Panel object
    """
    p = figure(title = "Glucose values over time.",plot_width = 750, plot_height = 400, tools="pan, hover", x_axis_type='datetime')

    line = p.line(x, y, legend_label = label)
    line2 = p.line(x, roll_mean, legend_label = "Rolling mean", color = "black", line_alpha = 0.5, line_dash = "dashdot", line_join = "miter")
    points = p.scatter(x, y2, color = "red", marker = "asterisk", size = 7, legend_label = "Glucose values")
    
    p.xaxis.axis_label = 'Date/Time (hour)'
    p.yaxis.axis_label = '(interpolated) Glucose levels (mmol/L)'
    p.xaxis.formatter=DatetimeTickFormatter(days="%m/%d",
        months="%m/%d %H:%M",
        hours="%H:%M",
        minutes="%H:%M"
    )
    p.legend.location = "top_left"
    p.legend.click_policy="hide"
    
    tab = Panel(child=p, title=label)
    return tab

tab1 = create_tab(df_glucose['time'], df_glucose['interpolated'], df_glucose['glucose'], roll_mean_linear, "Interpolated: linear")
tab2 = create_tab(df_glucose['time'], pchip_interpolation, df_glucose['glucose'], roll_mean_pchip, "Interpolated: pchip")

# Plot
show(Tabs(tabs=[tab1, tab2]))

## The interpolation method pchip seems to work better than the linear method. It has a more natural flow compared to the linear method.