# Project description

This dataset was provided by Una Health.<br> 
The main objective of this data challenge is to visualise blood glucose levels and find
connections between the blood glucose levels and tracked meals. <br>
<br>

### Data
The data consists of:
- A csv file with the historic blood glucose levels of the patient: levels_all.csv . We're interested in blood glucose readings which have an Aufzeichnungstyp of either 0 or 1 (those appear in different columns because they are different types, automatically collected every 15 minutes vs. manual scanned by the patient, but are readings from the same sensor and thus should be treated as such). The blood glucose reading is noted in Glukosewert-Verlauf mg/dL or Glukose-Scan mg/dL . All timestamps noted in this file are UTC. 
- A csv file with the tracked meals of the patient: activities_all.csv . Each meal isidentified by a UUID.

You can download the sample data from here: https://s3-de-central.profitbricks.com/una-health-data-challenge/una-health-data-challenge.zip

### Tasks

##### Visualise 
Create plots for the historic blood glucose level by each patient and the historic blood glucose level after meals (we usually look at timestamp_start of the meal + 3 hours worth of data). Feel free to group and slice the data for the individual meals as you see fit.

##### Interpret 
What conclusions can you draw when looking at the combined data of historic blood glucose levels and tracked meals for an individual patient and for certain meal types of an individual patient? What clusters (if any) do you find? 
What additional information (if any) do you need? What clustering methods would you apply?

##### Evaluation Criteria
General: selected and explained chosen software design and libraries, used clean code structure, followed and explained coding conventions
<br>
Algorithms and data structures: explained the approach and thinking and methods used, clarified variety and suitability of methods and algorithms used, explained selected data structure
<br>
Documentation: added clear inline & high-level documentation, added documentation on how to start & run the submitted solution
<br>
We evaluate your communication with us

# Instructions to set-up environment

In order to be able to run this notebook without issues, please configure the conda environment provided - UnaHealthenvironment.yml file. 
By creating a conda environment through the yml file, you should be able to reproduce the results seen in this notebook.


In order to create the environment please run the following commands in the anaconda prompt: 
conda env create -f < path to environment.yml >. 
Afterwards, please activate the created environment: 
conda activate ml.
Finally, you can initialize the jupyter notebook: 
jupyter notebook. 
    
Further information can be find here: https://docs.conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html

# Import packages

In [322]:
import os
from pathlib import Path
from math import ceil
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from joblib import dump
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
from sklearn.model_selection import train_test_split, StratifiedKFold, GridSearchCV
from sklearn.tree import DecisionTreeClassifier, export_graphviz
from sklearn.metrics import classification_report, confusion_matrix, precision_score
import random
import numpy as np
import pandas as pd
import scipy.stats as stats 
import matplotlib.colors as colors
import matplotlib.pyplot as plt
import matplotlib.gridspec as gspec
import matplotlib.cm as cm
import seaborn as sns
import re
import sklearn.preprocessing as prep
import sklearn.impute as imp
import sklearn.covariance as cov
from datetime import datetime, date
from scipy.cluster.hierarchy import dendrogram, linkage, set_link_color_palette
from sklearn.cluster import AgglomerativeClustering, KMeans
from yellowbrick.cluster import KElbowVisualizer
from sklearn.metrics import silhouette_samples, silhouette_score
from sklearn.mixture import GaussianMixture
from numpy.linalg import svd
from scipy.spatial import distance_matrix
from scipy.stats import chi2
from sklearn.tree import DecisionTreeClassifier, plot_tree, export_graphviz
from sklearn.model_selection import train_test_split
from sklearn.neighbors import NearestNeighbors
from sklearn.cluster import MeanShift, DBSCAN, estimate_bandwidth
from sklearn.impute import KNNImputer
from sklearn.neighbors import KNeighborsClassifier
from sklearn.decomposition import PCA
from sklearn.preprocessing import MinMaxScaler, StandardScaler
import requests
from zipfile import ZipFile
from io import BytesIO
import plotly.express as px

%matplotlib inline
%config InlineBackend.figure_format = 'retina' #better quality of visualisations

import warnings
warnings.filterwarnings('ignore')

Loaded backend module://ipykernel.pylab.backend_inline version unknown.


# Import Data

In [323]:
#Function to create customers based on the id logic
def create_customer(x):
    return 8*x+'-'+4*x+'-'+4*x+'-'+4*x+'-'+12*x       

In [324]:
#Put customers in a list
list_customers = []

for character in ['a', 'b', 'c']:
    this_customer = create_customer(character)
    list_customers.append(this_customer)    

In [325]:
#Connect to the url and unzip the file
r = requests.get("https://s3-de-central.profitbricks.com/una-health-data-challenge/una-health-data-challenge.zip")
files = ZipFile(BytesIO(r.content))

Starting new HTTPS connection (1): s3-de-central.profitbricks.com:443
https://s3-de-central.profitbricks.com:443 "GET /una-health-data-challenge/una-health-data-challenge.zip HTTP/1.1" 200 27737


In [326]:
#Go through each customer and read the respective csv 
df_activities = pd.read_csv(files.open(list_customers[0]+'/activities_all.csv'), encoding='utf-8')
df_levels = pd.read_csv(files.open(list_customers[0]+'/levels_all.csv'), encoding='utf-8', header=1)
df_levels['user_id'] = list_customers[0]

for customer in list_customers[1:]:
    this_customer_activities = pd.read_csv(files.open(customer+'/activities_all.csv'), 
                                           encoding='utf-8')
    this_customer_levels = pd.read_csv(files.open(customer+'/levels_all.csv'), 
                                           encoding='utf-8', header=1)
    this_customer_levels['user_id'] = customer
    df_activities = pd.concat([df_activities, this_customer_activities])
    df_levels = pd.concat([df_levels, this_customer_levels])

In [327]:
df_activities.head()

Unnamed: 0,id,user_id,record_type,description,timestamp_start,timestamp_end,payload,created,last_modified
0,aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaa00,aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa,MEAL_BREAKFAST,"40 g Haferflocken, 230 g Joghurt 0,3 %, 90 g B...",2021-02-15T08:30:00+01:00,,,,
1,aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaa01,aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa,MEAL_LUNCH,"98 g M�hren-Walnuss-VK-Brot, 87 g Gurke, 55 g ...",2021-02-15T12:45:00+01:00,,,,
2,aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaa02,aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa,MEAL_SNACK,"Mandarine, Teel�ffel Erdnussmu�",2021-02-15T16:15:00+01:00,,,,
3,aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaa03,aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa,ACTVITY_EASY,spazieren,2021-02-15T17:00:00+01:00,2021-02-15T17:30:00+01:00,,,
4,aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaa04,aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa,MEAL_DINNER,"50 g BasmatiVollkorn Reis, 20g Currypaste, 15 ...",2021-02-15T19:30:00+01:00,,,,


In [328]:
df_levels.head()

Unnamed: 0,Gerät,Seriennummer,Gerätezeitstempel,Aufzeichnungstyp,Glukosewert-Verlauf mg/dL,Glukose-Scan mg/dL,Nicht numerisches schnellwirkendes Insulin,Schnellwirkendes Insulin (Einheiten),Nicht numerische Nahrungsdaten,Kohlenhydrate (Gramm),Kohlenhydrate (Portionen),Nicht numerisches Depotinsulin,Depotinsulin (Einheiten),Notizen,Glukose-Teststreifen mg/dL,Keton mmol/L,Mahlzeiteninsulin (Einheiten),Korrekturinsulin (Einheiten),Insulin-Änderung durch Anwender (Einheiten),user_id
0,FreeStyle LibreLink,1D48A10E-DDFB-4888-8158-026F08814832,18-02-2021 10:57,0,77.0,,,,,,,,,,,,,,,aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa
1,FreeStyle LibreLink,1D48A10E-DDFB-4888-8158-026F08814832,18-02-2021 11:12,0,78.0,,,,,,,,,,,,,,,aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa
2,FreeStyle LibreLink,1D48A10E-DDFB-4888-8158-026F08814832,18-02-2021 11:27,0,78.0,,,,,,,,,,,,,,,aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa
3,FreeStyle LibreLink,1D48A10E-DDFB-4888-8158-026F08814832,18-02-2021 11:42,0,76.0,,,,,,,,,,,,,,,aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa
4,FreeStyle LibreLink,1D48A10E-DDFB-4888-8158-026F08814832,18-02-2021 11:57,0,75.0,,,,,,,,,,,,,,,aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa


# Quick assessment

In [329]:
df_activities.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 38 entries, 0 to 6
Data columns (total 9 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   id               38 non-null     object 
 1   user_id          38 non-null     object 
 2   record_type      38 non-null     object 
 3   description      38 non-null     object 
 4   timestamp_start  38 non-null     object 
 5   timestamp_end    2 non-null      object 
 6   payload          0 non-null      float64
 7   created          0 non-null      float64
 8   last_modified    0 non-null      float64
dtypes: float64(3), object(6)
memory usage: 3.0+ KB


In [330]:
df_levels.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3991 entries, 0 to 1422
Data columns (total 20 columns):
 #   Column                                       Non-Null Count  Dtype  
---  ------                                       --------------  -----  
 0   Gerät                                        3991 non-null   object 
 1   Seriennummer                                 3991 non-null   object 
 2   Gerätezeitstempel                            3991 non-null   object 
 3   Aufzeichnungstyp                             3991 non-null   int64  
 4   Glukosewert-Verlauf mg/dL                    3579 non-null   float64
 5   Glukose-Scan mg/dL                           345 non-null    float64
 6   Nicht numerisches schnellwirkendes Insulin   0 non-null      float64
 7   Schnellwirkendes Insulin (Einheiten)         0 non-null      float64
 8   Nicht numerische Nahrungsdaten               1 non-null      float64
 9   Kohlenhydrate (Gramm)                        0 non-null      float64
 10  

# Data Preprocessing

#### First the activities

In [333]:
df_activities_clean = df_activities.copy()

#keep only rows about meals
activities_to_keep = ['MEAL_DINNER', 'MEAL_LUNCH', 'MEAL_SNACK', 'MEAL_BREAKFAST']
df_activities_clean = df_activities_clean[df_activities_clean['record_type'].isin(activities_to_keep)]

#remove empty columns
df_activities_clean = df_activities_clean.iloc[:,:-4] 

#format datetime as timestamp
df_activities_clean['timestamp_start'] = df_activities_clean['timestamp_start'].apply(lambda x: pd.Timestamp(x[:-6]))

#add timestamp for 3h after meal
df_activities_clean['timestamp_aftermeal'] = df_activities_clean['timestamp_start'].apply(lambda x: x+pd.Timedelta(hours=3))


df_activities_clean.head()


Unnamed: 0,id,user_id,record_type,description,timestamp_start,timestamp_aftermeal
0,aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaa00,aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa,MEAL_BREAKFAST,"40 g Haferflocken, 230 g Joghurt 0,3 %, 90 g B...",2021-02-15 08:30:00,2021-02-15 11:30:00
1,aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaa01,aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa,MEAL_LUNCH,"98 g M�hren-Walnuss-VK-Brot, 87 g Gurke, 55 g ...",2021-02-15 12:45:00,2021-02-15 15:45:00
2,aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaa02,aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa,MEAL_SNACK,"Mandarine, Teel�ffel Erdnussmu�",2021-02-15 16:15:00,2021-02-15 19:15:00
4,aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaa04,aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa,MEAL_DINNER,"50 g BasmatiVollkorn Reis, 20g Currypaste, 15 ...",2021-02-15 19:30:00,2021-02-15 22:30:00
5,aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaa05,aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa,MEAL_BREAKFAST,"230 g Joghurt, 40g Haferflocken, 65 g Apfel",2021-02-17 08:15:00,2021-02-17 11:15:00


Looking at the levels dataframe

In [334]:
df_levels_clean = df_levels.copy()

#only Aufzeichnungstyp of 0 or 1
df_levels_clean = df_levels_clean[df_levels_clean['Aufzeichnungstyp'].isin([0,1])]

#Combine the info from verlauf and scan
df_levels_clean['Glukosewert-Verlauf mg/dL'] = df_levels_clean['Glukosewert-Verlauf mg/dL'].fillna(df_levels_clean['Glukose-Scan mg/dL'])

#Keep only columns with info
df_levels_clean = df_levels_clean[['user_id', 'Gerätezeitstempel', 'Aufzeichnungstyp', 'Glukosewert-Verlauf mg/dL']]

#Recode datetime as timestamp
df_levels_clean['timestamp'] = df_levels_clean['Gerätezeitstempel'].apply(lambda x: pd.Timestamp(x))

#Keep only the columns we are interested in
df_levels_clean = df_levels_clean[['user_id', 'timestamp', 'Aufzeichnungstyp', 'Glukosewert-Verlauf mg/dL']]

df_levels_clean

Unnamed: 0,user_id,timestamp,Aufzeichnungstyp,Glukosewert-Verlauf mg/dL
0,aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa,2021-02-18 10:57:00,0,77.0
1,aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa,2021-02-18 11:12:00,0,78.0
2,aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa,2021-02-18 11:27:00,0,78.0
3,aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa,2021-02-18 11:42:00,0,76.0
4,aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa,2021-02-18 11:57:00,0,75.0
...,...,...,...,...
1407,cccccccc-cccc-cccc-cccc-cccccccccccc,2021-02-23 20:18:00,1,106.0
1408,cccccccc-cccc-cccc-cccc-cccccccccccc,2021-02-23 21:29:00,1,108.0
1409,cccccccc-cccc-cccc-cccc-cccccccccccc,2021-02-24 04:12:00,1,113.0
1410,cccccccc-cccc-cccc-cccc-cccccccccccc,2021-02-24 07:17:00,1,134.0


Now we want to merge glucose levels information to the meals dataframe

In [335]:
#Function to search for glucose level prior to meal start

def get_glucose_before_meal_a(x):
    #Search for this user's glucose levels in df_levels and sort by time
    df_to_search = df_levels_clean[df_levels_clean['user_id']==create_customer('a')]
    df_to_search = df_to_search.sort_values(by='timestamp')
    #Return the last glucose measurement prior to the meal start
    return df_to_search[df_to_search['timestamp']<x].iloc[-1:,]['Glukosewert-Verlauf mg/dL'].iloc[0]

#Bad solution (should be integrated in the function but time was running out! and I only noticed in the end...)
def get_glucose_before_meal_b(x):
    df_to_search = df_levels_clean[df_levels_clean['user_id']==create_customer('b')]
    df_to_search = df_to_search.sort_values(by='timestamp')
    return df_to_search[df_to_search['timestamp']<x].iloc[-1:,]['Glukosewert-Verlauf mg/dL'].iloc[0]

def get_glucose_before_meal_c(x):
    df_to_search = df_levels_clean[df_levels_clean['user_id']==create_customer('c')]
    df_to_search = df_to_search.sort_values(by='timestamp')
    return df_to_search[df_to_search['timestamp']<x].iloc[-1:,]['Glukosewert-Verlauf mg/dL'].iloc[0]

In [336]:
#Function to look for the max of glucose level in the 3h following meal start

def get_glucose_after_meal_a(x):
    #Search for this user's glucose levels in df_levels and sort by time
    df_to_search = df_levels_clean[(df_levels_clean['user_id']==create_customer('a')) & (df_levels_clean['timestamp']>x-pd.Timedelta(hours=3))]
    df_to_search = df_to_search.sort_values(by='timestamp')
    #Return the highest value found in that interval
    return df_to_search[df_to_search['timestamp']<x]['Glukosewert-Verlauf mg/dL'].max()

#Bad solution (should be integrated in the function but time was running out! and I only noticed in the end...)
def get_glucose_after_meal_b(x):
    df_to_search = df_levels_clean[(df_levels_clean['user_id']==create_customer('b')) & (df_levels_clean['timestamp']>x-pd.Timedelta(hours=3))]
    df_to_search = df_to_search.sort_values(by='timestamp')
    return df_to_search[df_to_search['timestamp']<x]['Glukosewert-Verlauf mg/dL'].max()

def get_glucose_after_meal_c(x):
    df_to_search = df_levels_clean[(df_levels_clean['user_id']==create_customer('c')) & (df_levels_clean['timestamp']>x-pd.Timedelta(hours=3))]
    df_to_search = df_to_search.sort_values(by='timestamp')
    return df_to_search[df_to_search['timestamp']<x]['Glukosewert-Verlauf mg/dL'].max()

In [377]:
#Function to look for the time at which the glucose was at the highest level in the 3h following the meal

def get_glucose_after_meal_spike_time_a(x):
    #Search for this user's glucose levels in df_levels and sort by time
    df_to_search = df_levels_clean[(df_levels_clean['user_id']==create_customer('a')) & (df_levels_clean['timestamp']>x-pd.Timedelta(hours=3))]
    df_to_search = df_to_search.sort_values(by='timestamp')
    #Return the highest value found in that interval
    max_index = df_to_search[df_to_search['timestamp']<x]['Glukosewert-Verlauf mg/dL'].idxmax()
    return df_to_search.loc[max_index]['timestamp']

def get_glucose_after_meal_spike_time_b(x):
    #Search for this user's glucose levels in df_levels and sort by time
    df_to_search = df_levels_clean[(df_levels_clean['user_id']==create_customer('b')) & (df_levels_clean['timestamp']>x-pd.Timedelta(hours=3))]
    df_to_search = df_to_search.sort_values(by='timestamp')
    #Return the highest value found in that interval
    max_index = df_to_search[df_to_search['timestamp']<x]['Glukosewert-Verlauf mg/dL'].idxmax()
    return df_to_search.loc[max_index]['timestamp']

def get_glucose_after_meal_spike_time_c(x):
    #Search for this user's glucose levels in df_levels and sort by time
    df_to_search = df_levels_clean[(df_levels_clean['user_id']==create_customer('c')) & (df_levels_clean['timestamp']>x-pd.Timedelta(hours=3))]
    df_to_search = df_to_search.sort_values(by='timestamp')
    #Return the highest value found in that interval
    max_index = df_to_search[df_to_search['timestamp']<x]['Glukosewert-Verlauf mg/dL'].idxmax()
    return df_to_search.loc[max_index]['timestamp']

In [389]:
#Applying functions to get glucose levels (each user at a time)
df_activities_a = df_activities_clean[df_activities_clean['user_id']==create_customer('a')]
df_activities_a['glucose_start'] = df_activities_a['timestamp_start'].apply(get_glucose_before_meal)
df_activities_a['glucose_after'] = df_activities_a['timestamp_aftermeal'].apply(get_glucose_after_meal)
df_activities_a['peak time'] = df_activities_a['timestamp_aftermeal'].apply(get_glucose_after_meal_spike_time_a)
df_activities_a['minutes until peak'] = (df_activities_a['peak time']-df_activities_a['timestamp_start']).astype('timedelta64[m]')
df_activities_a.head()

Unnamed: 0,id,user_id,record_type,description,timestamp_start,timestamp_aftermeal,glucose_start,glucose_after,peak time,minutes until peak
0,aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaa00,aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa,MEAL_BREAKFAST,"40 g Haferflocken, 230 g Joghurt 0,3 %, 90 g B...",2021-02-15 08:30:00,2021-02-15 11:30:00,80.0,109.0,2021-02-15 08:55:00,25.0
1,aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaa01,aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa,MEAL_LUNCH,"98 g M�hren-Walnuss-VK-Brot, 87 g Gurke, 55 g ...",2021-02-15 12:45:00,2021-02-15 15:45:00,83.0,121.0,2021-02-15 13:44:00,59.0
2,aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaa02,aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa,MEAL_SNACK,"Mandarine, Teel�ffel Erdnussmu�",2021-02-15 16:15:00,2021-02-15 19:15:00,73.0,102.0,2021-02-15 16:57:00,42.0
4,aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaa04,aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa,MEAL_DINNER,"50 g BasmatiVollkorn Reis, 20g Currypaste, 15 ...",2021-02-15 19:30:00,2021-02-15 22:30:00,73.0,101.0,2021-02-15 21:21:00,111.0
5,aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaa05,aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa,MEAL_BREAKFAST,"230 g Joghurt, 40g Haferflocken, 65 g Apfel",2021-02-17 08:15:00,2021-02-17 11:15:00,78.0,108.0,2021-02-17 08:51:00,36.0


In [390]:
#Applying functions to get glucose levels (each user at a time)
df_activities_b = df_activities_clean[df_activities_clean['user_id']==create_customer('b')]
df_activities_b['glucose_start'] = df_activities_b['timestamp_start'].apply(get_glucose_before_meal)
df_activities_b['glucose_after'] = df_activities_b['timestamp_aftermeal'].apply(get_glucose_after_meal)
df_activities_b['peak time'] = df_activities_b['timestamp_aftermeal'].apply(get_glucose_after_meal_spike_time_b)
df_activities_b['minutes until peak'] = (df_activities_b['peak time']-df_activities_b['timestamp_start']).astype('timedelta64[m]')
df_activities_b.head()

Unnamed: 0,id,user_id,record_type,description,timestamp_start,timestamp_aftermeal,glucose_start,glucose_after,peak time,minutes until peak
0,bbbbbbbb-bbbb-bbbb-bbbb-bbbbbbbbbb00,bbbbbbbb-bbbb-bbbb-bbbb-bbbbbbbbbbbb,MEAL_BREAKFAST,2 Pott Kaffe +Zucker,2021-02-19 06:30:00,2021-02-19 09:30:00,71.0,109.0,2021-02-19 08:14:00,104.0
1,bbbbbbbb-bbbb-bbbb-bbbb-bbbbbbbbbb01,bbbbbbbb-bbbb-bbbb-bbbb-bbbbbbbbbbbb,MEAL_BREAKFAST,Gem�sesuppe instant,2021-02-19 07:30:00,2021-02-19 10:30:00,80.0,109.0,2021-02-19 08:14:00,44.0
2,bbbbbbbb-bbbb-bbbb-bbbb-bbbbbbbbbb02,bbbbbbbb-bbbb-bbbb-bbbb-bbbbbbbbbbbb,MEAL_SNACK,Gem�sesuppe instant + Gr�ner Tee ohne alles,2021-02-19 09:00:00,2021-02-19 12:00:00,87.0,109.0,2021-02-19 09:29:00,29.0
4,bbbbbbbb-bbbb-bbbb-bbbb-bbbbbbbbbb04,bbbbbbbb-bbbb-bbbb-bbbb-bbbbbbbbbbbb,MEAL_LUNCH,"Kartoffeln (150 g), Quark 250 g, Bohnen 250g ,...",2021-02-19 12:45:00,2021-02-19 15:45:00,102.0,111.0,2021-02-19 15:30:00,165.0
5,bbbbbbbb-bbbb-bbbb-bbbb-bbbbbbbbbb05,bbbbbbbb-bbbb-bbbb-bbbb-bbbbbbbbbbbb,MEAL_SNACK,"kaffee und etwas zucker, scheibe Brot mit L�tt...",2021-02-19 15:30:00,2021-02-19 18:30:00,86.0,92.0,2021-02-19 16:15:00,45.0


In [391]:
#Applying functions to get glucose levels (each user at a time)
df_activities_c = df_activities_clean[df_activities_clean['user_id']==create_customer('c')]
df_activities_c['glucose_start'] = df_activities_c['timestamp_start'].apply(get_glucose_before_meal)
df_activities_c['glucose_after'] = df_activities_c['timestamp_aftermeal'].apply(get_glucose_after_meal)
df_activities_c['peak time'] = df_activities_c['timestamp_aftermeal'].apply(get_glucose_after_meal_spike_time_c)
df_activities_c['minutes until peak'] = (df_activities_c['peak time']-df_activities_c['timestamp_start']).astype('timedelta64[m]')
df_activities_c.head()

Unnamed: 0,id,user_id,record_type,description,timestamp_start,timestamp_aftermeal,glucose_start,glucose_after,peak time,minutes until peak
0,cccccccc-cccc-cccc-cccc-cccccccccc00,cccccccc-cccc-cccc-cccc-cccccccccccc,MEAL_BREAKFAST,"3 Vollkorntoast, 75g Schmelzk�se, 35g Butter, ...",2021-02-17 06:00:00,2021-02-17 09:00:00,69.0,108.0,2021-02-17 07:16:00,76.0
1,cccccccc-cccc-cccc-cccc-cccccccccc01,cccccccc-cccc-cccc-cccc-cccccccccccc,MEAL_LUNCH,"350g Spinat,250g Kartoffeln, 2 Spiegeleier, 40...",2021-02-17 13:45:00,2021-02-17 16:45:00,114.0,108.0,2021-02-17 16:04:00,139.0
2,cccccccc-cccc-cccc-cccc-cccccccccc02,cccccccc-cccc-cccc-cccc-cccccccccccc,MEAL_DINNER,"1 Vollkornbrot, 25g Butter, 65g Romadur, 20g E...",2021-02-17 19:45:00,2021-02-17 22:45:00,90.0,117.0,2021-02-17 21:36:00,111.0
3,cccccccc-cccc-cccc-cccc-cccccccccc03,cccccccc-cccc-cccc-cccc-cccccccccccc,MEAL_LUNCH,"140g gebackener Leberk�se, 150g Erbsen-M�hren,...",2021-02-19 11:45:00,2021-02-19 14:45:00,76.0,111.0,2021-02-19 12:37:00,52.0
4,cccccccc-cccc-cccc-cccc-cccccccccc04,cccccccc-cccc-cccc-cccc-cccccccccccc,MEAL_DINNER,"1 Vollkornbr�tchen, 70g Leberk�se, 100g Mixed ...",2021-02-19 18:30:00,2021-02-19 21:30:00,81.0,125.0,2021-02-19 19:41:00,71.0


In [393]:
#Combining the customers back into a single dataframe
df_activities_ready = pd.concat([df_activities_a, df_activities_b, df_activities_c])

#Adding column for difference in glucose between start and after meal
df_activities_ready['change_aftermeal'] = df_activities_ready['glucose_after'] - df_activities_ready['glucose_start']

#Adding column for pct difference
df_activities_ready['pct_change_aftermeal'] = round((df_activities_ready['glucose_after'] - df_activities_ready['glucose_start'])/df_activities_ready['glucose_start'],4)*100

#Adding column for number of ingredients in meal
df_activities_ready['number of components'] = df_activities_ready['description'].apply(lambda x: len(x.split(',')))

#Cleaning type of meal
df_activities_ready['record_type'] = df_activities_ready['record_type'].apply(lambda x: x.split('_')[-1])


df_activities_ready.head()

Unnamed: 0,id,user_id,record_type,description,timestamp_start,timestamp_aftermeal,glucose_start,glucose_after,peak time,minutes until peak,change_aftermeal,pct_change_aftermeal,number of components
0,aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaa00,aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa,BREAKFAST,"40 g Haferflocken, 230 g Joghurt 0,3 %, 90 g B...",2021-02-15 08:30:00,2021-02-15 11:30:00,80.0,109.0,2021-02-15 08:55:00,25.0,29.0,36.25,5
1,aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaa01,aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa,LUNCH,"98 g M�hren-Walnuss-VK-Brot, 87 g Gurke, 55 g ...",2021-02-15 12:45:00,2021-02-15 15:45:00,83.0,121.0,2021-02-15 13:44:00,59.0,38.0,45.78,4
2,aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaa02,aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa,SNACK,"Mandarine, Teel�ffel Erdnussmu�",2021-02-15 16:15:00,2021-02-15 19:15:00,73.0,102.0,2021-02-15 16:57:00,42.0,29.0,39.73,2
4,aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaa04,aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa,DINNER,"50 g BasmatiVollkorn Reis, 20g Currypaste, 15 ...",2021-02-15 19:30:00,2021-02-15 22:30:00,73.0,101.0,2021-02-15 21:21:00,111.0,28.0,38.36,12
5,aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaa05,aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa,BREAKFAST,"230 g Joghurt, 40g Haferflocken, 65 g Apfel",2021-02-17 08:15:00,2021-02-17 11:15:00,78.0,108.0,2021-02-17 08:51:00,36.0,30.0,38.46,3


Creating a dataframe that contains one line per meal component

In [343]:
#Trying to get the quantity of each component
def get_quantity(x):
    quantity = '0'
    for n in range(len(x)):
        if str(x[n]) in ['0','1','2','3','4','5','6','7','8', '9']:
            quantity+=x[n]
    return float(quantity)
            

In [402]:
df_activities_meal_components = df_activities_ready.copy()

#Separating components based on commas and exploding to create new rows
df_activities_meal_components['meal component'] = df_activities_meal_components['description'].apply(lambda x: x.split(','))
df_activities_meal_components = df_activities_meal_components.explode('meal component')

#Getting the quantity of each component
df_activities_meal_components['Quantity'] = df_activities_meal_components['meal component'].apply(get_quantity)

#Getting only the name by splitting based on the spaces
df_activities_meal_components['meal component'] = df_activities_meal_components['meal component'].apply(lambda x: x.split(' ')[-1])

df_activities_meal_components.tail()

Unnamed: 0,id,user_id,record_type,description,timestamp_start,timestamp_aftermeal,glucose_start,glucose_after,peak time,minutes until peak,change_aftermeal,pct_change_aftermeal,number of components,day,meal component,Quantity
6,cccccccc-cccc-cccc-cccc-cccccccccc06,cccccccc-cccc-cccc-cccc-cccccccccccc,DINNER,"1 Schb. Vollkorndinkelbrot, 30g Butter, 100g K...",2021-02-21 18:00:00,2021-02-21 21:00:00,90.0,104.0,2021-02-21 19:13:00,73.0,14.0,15.56,5,21,Vollkorndinkelbrot,1.0
6,cccccccc-cccc-cccc-cccc-cccccccccc06,cccccccc-cccc-cccc-cccc-cccccccccccc,DINNER,"1 Schb. Vollkorndinkelbrot, 30g Butter, 100g K...",2021-02-21 18:00:00,2021-02-21 21:00:00,90.0,104.0,2021-02-21 19:13:00,73.0,14.0,15.56,5,21,Butter,30.0
6,cccccccc-cccc-cccc-cccc-cccccccccc06,cccccccc-cccc-cccc-cccc-cccccccccccc,DINNER,"1 Schb. Vollkorndinkelbrot, 30g Butter, 100g K...",2021-02-21 18:00:00,2021-02-21 21:00:00,90.0,104.0,2021-02-21 19:13:00,73.0,14.0,15.56,5,21,Kartoffelsalat,100.0
6,cccccccc-cccc-cccc-cccc-cccccccccc06,cccccccc-cccc-cccc-cccc-cccccccccccc,DINNER,"1 Schb. Vollkorndinkelbrot, 30g Butter, 100g K...",2021-02-21 18:00:00,2021-02-21 21:00:00,90.0,104.0,2021-02-21 19:13:00,73.0,14.0,15.56,5,21,Tomatensalat,100.0
6,cccccccc-cccc-cccc-cccc-cccccccccc06,cccccccc-cccc-cccc-cccc-cccccccccccc,DINNER,"1 Schb. Vollkorndinkelbrot, 30g Butter, 100g K...",2021-02-21 18:00:00,2021-02-21 21:00:00,90.0,104.0,2021-02-21 19:13:00,73.0,14.0,15.56,5,21,Rotwein,400.0


### Note: Advanced text mining is not my stronger suit so if that is very important for the short-term, I might not be the person you are looking for! I can learn though but I am no expert at the moment :)

# Visualizing data

#### Important note
All the observations made here are purely exploratory and intend only to demonstrate how the visualisations could help in finding hypothesis that would have to be verified with further investigation.

In [403]:
df_levels_clean.head()

Unnamed: 0,user_id,timestamp,Aufzeichnungstyp,Glukosewert-Verlauf mg/dL
0,aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa,2021-02-18 10:57:00,0,77.0
1,aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa,2021-02-18 11:12:00,0,78.0
2,aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa,2021-02-18 11:27:00,0,78.0
3,aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa,2021-02-18 11:42:00,0,76.0
4,aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa,2021-02-18 11:57:00,0,75.0


In [404]:
df_activities_ready.head()

Unnamed: 0,id,user_id,record_type,description,timestamp_start,timestamp_aftermeal,glucose_start,glucose_after,peak time,minutes until peak,change_aftermeal,pct_change_aftermeal,number of components,day
0,aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaa00,aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa,BREAKFAST,"40 g Haferflocken, 230 g Joghurt 0,3 %, 90 g B...",2021-02-15 08:30:00,2021-02-15 11:30:00,80.0,109.0,2021-02-15 08:55:00,25.0,29.0,36.25,5,15
1,aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaa01,aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa,LUNCH,"98 g M�hren-Walnuss-VK-Brot, 87 g Gurke, 55 g ...",2021-02-15 12:45:00,2021-02-15 15:45:00,83.0,121.0,2021-02-15 13:44:00,59.0,38.0,45.78,4,15
2,aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaa02,aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa,SNACK,"Mandarine, Teel�ffel Erdnussmu�",2021-02-15 16:15:00,2021-02-15 19:15:00,73.0,102.0,2021-02-15 16:57:00,42.0,29.0,39.73,2,15
4,aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaa04,aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa,DINNER,"50 g BasmatiVollkorn Reis, 20g Currypaste, 15 ...",2021-02-15 19:30:00,2021-02-15 22:30:00,73.0,101.0,2021-02-15 21:21:00,111.0,28.0,38.36,12,15
5,aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaa05,aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa,BREAKFAST,"230 g Joghurt, 40g Haferflocken, 65 g Apfel",2021-02-17 08:15:00,2021-02-17 11:15:00,78.0,108.0,2021-02-17 08:51:00,36.0,30.0,38.46,3,17


In [406]:
df_activities_meal_components.tail()

Unnamed: 0,id,user_id,record_type,description,timestamp_start,timestamp_aftermeal,glucose_start,glucose_after,peak time,minutes until peak,change_aftermeal,pct_change_aftermeal,number of components,day,meal component,Quantity
6,cccccccc-cccc-cccc-cccc-cccccccccc06,cccccccc-cccc-cccc-cccc-cccccccccccc,DINNER,"1 Schb. Vollkorndinkelbrot, 30g Butter, 100g K...",2021-02-21 18:00:00,2021-02-21 21:00:00,90.0,104.0,2021-02-21 19:13:00,73.0,14.0,15.56,5,21,Vollkorndinkelbrot,1.0
6,cccccccc-cccc-cccc-cccc-cccccccccc06,cccccccc-cccc-cccc-cccc-cccccccccccc,DINNER,"1 Schb. Vollkorndinkelbrot, 30g Butter, 100g K...",2021-02-21 18:00:00,2021-02-21 21:00:00,90.0,104.0,2021-02-21 19:13:00,73.0,14.0,15.56,5,21,Butter,30.0
6,cccccccc-cccc-cccc-cccc-cccccccccc06,cccccccc-cccc-cccc-cccc-cccccccccccc,DINNER,"1 Schb. Vollkorndinkelbrot, 30g Butter, 100g K...",2021-02-21 18:00:00,2021-02-21 21:00:00,90.0,104.0,2021-02-21 19:13:00,73.0,14.0,15.56,5,21,Kartoffelsalat,100.0
6,cccccccc-cccc-cccc-cccc-cccccccccc06,cccccccc-cccc-cccc-cccc-cccccccccccc,DINNER,"1 Schb. Vollkorndinkelbrot, 30g Butter, 100g K...",2021-02-21 18:00:00,2021-02-21 21:00:00,90.0,104.0,2021-02-21 19:13:00,73.0,14.0,15.56,5,21,Tomatensalat,100.0
6,cccccccc-cccc-cccc-cccc-cccccccccc06,cccccccc-cccc-cccc-cccc-cccccccccccc,DINNER,"1 Schb. Vollkorndinkelbrot, 30g Butter, 100g K...",2021-02-21 18:00:00,2021-02-21 21:00:00,90.0,104.0,2021-02-21 19:13:00,73.0,14.0,15.56,5,21,Rotwein,400.0


#### Evolution of glucose for one user (a)

In [348]:
df = df_levels_clean[df_levels_clean['user_id']==create_customer('a')].sort_values(by='timestamp')
fig = px.line(df, x='timestamp', y="Glukosewert-Verlauf mg/dL")
fig.show()

In [349]:
df = df_levels_clean[df_levels_clean['user_id']==create_customer('a')].sort_values(by='timestamp')
df['day'] = df['timestamp'].dt.day
df['hour'] = df['timestamp'].dt.hour
df['glucose'] = df['Glukosewert-Verlauf mg/dL']

fig = px.line(df, x='hour', y="glucose", 
             facet_col="day", facet_col_wrap=2,
             title="Hourly evolution of glucose levels per day")
fig.show()

- Glucose levels ranged between 53 and 162 mg/dl for user 'a'
- Althought we are able to see a general trend of 3 spikes a day that should correspod to the periods after the 3 major meals, it is also possible to observe differences in glucose levels variability across days. For example, on the 20th and 22th the night peak seems less steep when compared to days 16 or 18. Would be interesting to explore what changed in terms of meals on those dates.
- Since the days for which we have meal data (15, 17 and 19) seem to show a similar pattern in glucose variability, we would expect that the meal composition and schedule would also be similar on those days

#### Are there differences in the impact of differnet meal types on glucose levels?

In [350]:
df = pd.melt(df_activities_ready[df_activities_ready['user_id']==create_customer('a')], 
             id_vars=['user_id', 'record_type', 'timestamp_start'], value_vars=['glucose_start', 'glucose_after'])

df['Glucose level'] = df['value']
df['Meal type'] = df['record_type']
df['Type of measurement'] = df['variable'].apply(lambda x: 'Before the meal' if x=='glucose_start' else 'Max in 3h after meal')

fig = px.box(df, x="Meal type", y="Glucose level", color="Type of measurement",
             title="Distribution of glucose levels before and after meal by meal type",
            )
fig.show()

- Glucose levels both before and after a meal seem to be more consistent in the morning.
- After dinner glucose levels seem to show the greatest variability inthe period analyzed.
- Snacks seem to have the least impact on glucose levels for this user.

#### Is there variability in the impact on glucose levels of the same meal type for the same individual?

In [396]:
df_activities_ready['day'] = df_activities_ready['timestamp_start'].dt.day
df_activities_ready.head()

Unnamed: 0,id,user_id,record_type,description,timestamp_start,timestamp_aftermeal,glucose_start,glucose_after,peak time,minutes until peak,change_aftermeal,pct_change_aftermeal,number of components,day
0,aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaa00,aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa,BREAKFAST,"40 g Haferflocken, 230 g Joghurt 0,3 %, 90 g B...",2021-02-15 08:30:00,2021-02-15 11:30:00,80.0,109.0,2021-02-15 08:55:00,25.0,29.0,36.25,5,15
1,aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaa01,aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa,LUNCH,"98 g M�hren-Walnuss-VK-Brot, 87 g Gurke, 55 g ...",2021-02-15 12:45:00,2021-02-15 15:45:00,83.0,121.0,2021-02-15 13:44:00,59.0,38.0,45.78,4,15
2,aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaa02,aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa,SNACK,"Mandarine, Teel�ffel Erdnussmu�",2021-02-15 16:15:00,2021-02-15 19:15:00,73.0,102.0,2021-02-15 16:57:00,42.0,29.0,39.73,2,15
4,aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaa04,aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa,DINNER,"50 g BasmatiVollkorn Reis, 20g Currypaste, 15 ...",2021-02-15 19:30:00,2021-02-15 22:30:00,73.0,101.0,2021-02-15 21:21:00,111.0,28.0,38.36,12,15
5,aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaa05,aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa,BREAKFAST,"230 g Joghurt, 40g Haferflocken, 65 g Apfel",2021-02-17 08:15:00,2021-02-17 11:15:00,78.0,108.0,2021-02-17 08:51:00,36.0,30.0,38.46,3,17


In [400]:
df = df_activities_ready[df_activities_ready['user_id']==create_customer('a')]
df['Percentage change in glucose after the meal'] = df['change_aftermeal']
df['Type of meal'] = df['record_type']

fig = px.box(df, x="Type of meal", y="Percentage change in glucose after the meal", color="day",
             title="Variability of glucose levels before and after meal by meal type",
            )
fig.show()

- The effect of breakfast on glucose levels seems to be most stable for the 3 days of data available, followed by lunch.
- For snacks and dinner, it seems that the the different meals had a very different impact on the days available.
- Possibly the snack on the 19th and dinner of the 15th are better options than the others.

In [399]:
df = df_activities_ready[df_activities_ready['user_id']==create_customer('a')]
df['Percentage change in glucose after the meal'] = df['change_aftermeal']
df['Type of meal'] = df['record_type']

fig = px.box(df, x="Type of meal", y="minutes until peak", color="day",
             title="Variability of time until glucose peak (minutes) by meal type",
            )
fig.show()

# Clustering

If I have a dataset containing the user characteristics, info about each meal, the composition of the meals (ideally with the quantities and nutrients that make up each product that was consumed), respective datetime info (e.g. time of the day, day of teh week) and variables describing the glucose response (e.g. diference before and after, time to peak) I can use clustering for tasks such as:
- Identifying groups of food or nutrients that show similar patterns in the way they relate with glucose response
- Identify groups of individuals (or characteristics) that are 


# Prediction

With a similar dataset one could model the relationship between meal composition and a target such as glucose response. This would allow to study:
- If you can predict future response to meals based on the nutritional composition
- How the characteristics of a person are related to the profile of glucose response to certain meal or nutrients