# Analysis of Diabetes Log Records

<i>By Diego Ramallo</i>

<img src="glucose_time_grid_header.png" alt="Drawing" style="width: 1000px;"/>

## Purpose

I found this interesting diabetes dataset on the UCI ML repository that had time series data for 70 patients that was available [datasets](https://archive.ics.uci.edu/ml/datasets/Diabetes). 

I'd like to explore it thoroughly, but first I'll begin by looking at the data for only one patient.

[Formatting Data](#Formatting Data)

[Visualizing data](#Visualizing Data)

[Data Consolidation](#Data Consolidation)

[Constructing Glucose Time Grid](#Constructing Glucose Time Grid)

## Formatting Data

<a id='Formatting Data'></a>

Interesting. I thought that the data that I was looking at was just blood glucose values at different times, with different events occurring. However, after reading this closer, it seems less clear that this may not necessarily be the case. Although there are a lot of reasonable values in the 'Value' field (eg. between 80-250), there are also a lot of 0-10 values which are not realistic. The documentation for the data isn't too useful. It just says that there is a code for the activity (which could've occurred at same datetime). 

I think that the 'Value' column isn't just blood-glucose, it's more realistic that it represents the value of the [code activity](https://archive.ics.uci.edu/ml/datasets/Diabetes). If that's the case, then I can just find the codes that I know are glucose measurements in order to get timelapse bg measurements :)

In [9]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import pandas as pd
import warnings
import seaborn as sns
warnings.filterwarnings('ignore')
%matplotlib nbagg

In [10]:
#Here's the key of codes:

codes= {
    33 : 'Regular insulin dose', 34 : 'NPH insulin dose', 
    35 : 'UltraLente insulin dose' ,48 : 'Unspecified blood glucose measurement', 
    57 : 'Unspecified blood glucose measurement', 58: 'Pre-breakfast blood glucose measurement', 
    59 : 'Post-breakfast blood glucose measurement', 60 : 'Pre-lunch blood glucose measurement', 
    61 : 'Post-lunch blood glucose measurement', 62 : 'Pre-supper blood glucose measurement', 
    63 : 'Post-supper blood glucose measurement', 64 : 'Pre-snack blood glucose measurement', 
    65 : 'Hypoglycemic symptoms', 66 : 'Typical meal ingestion', 
    67 : 'More-than-usual meal ingestion', 68 : 'Less-than-usual meal ingestion', 
    69 : 'Typical exercise activity', 70 : 'More-than-usual exercise activity', 
    71 : 'Less-than-usual exercise activity', 72 : 'Unspecified special event'
        }

glucose_codes= [48, 57, 58, 59, 60, 61, 62, 63, 64]
glucose_titles = [codes[48], codes[57], codes[58], codes[59], codes[60],
               codes[61], codes[62], codes[63], codes[64]]

In [11]:
data01= pd.read_csv('Diabetes-Data/data-01', sep= '\t', header = None)
data01.columns= ['date','time','code','bg_value']
data01.head()

Unnamed: 0,date,time,code,bg_value
0,04-21-1991,9:09,58,100
1,04-21-1991,9:09,33,9
2,04-21-1991,9:09,34,13
3,04-21-1991,17:08,62,119
4,04-21-1991,17:08,33,7


In [12]:
data01['datetime']= pd.to_datetime(data01['date']+' '+data01['time'])
data01= data01.drop(['date','time'], axis= 1)
data01.head()

Unnamed: 0,code,bg_value,datetime
0,58,100,1991-04-21 09:09:00
1,33,9,1991-04-21 09:09:00
2,34,13,1991-04-21 09:09:00
3,62,119,1991-04-21 17:08:00
4,33,7,1991-04-21 17:08:00


In [13]:
#Now let's filter the data so we only see glucose measurement data
glucoseOnly= data01[data01['code'].isin(glucose_codes)]


print 'These are the unique types of glucose measurements that are in our dataset: {}'.format(glucoseOnly['code'].unique())

glucoseOnly.head(5)

These are the unique types of glucose measurements that are in our dataset: [58 62 48 60]


Unnamed: 0,code,bg_value,datetime
0,58,100,1991-04-21 09:09:00
3,62,119,1991-04-21 17:08:00
5,48,123,1991-04-21 22:51:00
6,58,216,1991-04-22 07:35:00
10,62,211,1991-04-22 16:56:00


Excellent, now we know we only have codes 58, 62, 48, and 60 in our glucoseOnly table representing 'Pre-breakfast blood glucose measurement', 'Pre-supper blood glucose measurement', 'Unspecified blood glucose measurement', and 'Pre-lunch blood glucose measurement'.

Although there are other relevant codes, we can explore those later and at least visualize the data in the meantime.

## Visualizing Data

<a id='Visualizing Data'></a>

In [358]:
#Neat, now let's visualize the data over time!
figure, ax = plt.subplots(1,1, figsize=(10,6))
plt.plot(glucoseOnly['datetime'], glucoseOnly['bg_value'], '.r')
plt.plot(glucoseOnly['datetime'], glucoseOnly['bg_value'], color= 'orange')
plt.xlabel('Time', fontsize= 18, fontweight= 'bold')
plt.ylabel('BG Value', fontsize= 18, fontweight= 'bold')
plt.title('Blood Glucose Values for Subject 01', fontsize= 20, fontweight= 'bold')
ax.set_axis_bgcolor('#333435')
ax.set_axis_bgcolor('#333435')


<IPython.core.display.Javascript object>

In [15]:
#One other thing I'd like to do is to visualize the rolling mean (moving average) of my data
#let's calculate that

bgMovingMean3= pd.rolling_mean(glucoseOnly['bg_value'],3, center=True)
print len(bgMovingMean3), len(glucoseOnly)

369 369


In [363]:
#Now let's plot the rolling average on top of these measurements
figure, ax = plt.subplots(1,1, figsize=(10,6))
plt.plot(glucoseOnly['bg_value'], '.r')
plt.plot(glucoseOnly['bg_value'], color= 'orange')
plt.plot(pd.rolling_mean(glucoseOnly['bg_value'],3, center=True), color= '#40c4ed')

#Let's also plot the thresholds for typical thresholds (below healthy range= 80, and above range = 250)
plt.plot([250]*1000, marker= '_', color= 'k', alpha= 0.3)
plt.plot([80]*1000, marker= '_', color= 'k', alpha= 0.3)

plt.xlabel('Time', fontsize= 18, fontweight= 'bold')
plt.ylabel('BG Value', fontsize= 18, fontweight= 'bold')
plt.title('Blood Glucose Values for Subject 01', fontsize= 20, fontweight= 'bold')
ax.set_axis_bgcolor('#333435')


<IPython.core.display.Javascript object>

For my next figure, I think it'd be neat to have an interactive plot of the raw data where I can hover over points and identify what the code was for that measurement! :)

In [17]:
%matplotlib inline

In [367]:
#Extra interactive Bokeh Figure

from bokeh.plotting import figure, output_notebook, show, ColumnDataSource
from bokeh.models import HoverTool
from bokeh.models import NumeralTickFormatter
from bokeh.models import DatetimeTickFormatter


output_notebook()

#We'll now make the source for the info on our hover
#The important thing to note is that for each one of our clusters (3)
#we'll have identical fields except for the indices we call to only 
#SELECT a subset of the data points we want for each cluster

source0= ColumnDataSource(data= dict(
        x= glucoseOnly['datetime'], y=  glucoseOnly['bg_value'],
        #marker_size= total_goals_for[indexCluster0]/1.5,
        #goals= datanumLabels['total_goals_for'].values[indexCluster0],
        title= glucoseOnly['code'].apply(lambda x: codes[x])))

# source1= ColumnDataSource(data= dict(
#         x= cluster1[:,0], y= cluster1[:,1],
#         marker_size= total_goals_for[indexCluster0]/1.5,
#         goals= datanumLabels['total_goals_for'].values[indexCluster1],
#         title= team_names[indexCluster1]))

#Then, when we use the hover tool to bring up our info when we hover, 
#all of them will follow the @title and the @goals assignments even
#though the data will come FROM different sources (0,1,2) for each cluster

hover= HoverTool(tooltips= [("Activity", " @title")])

#Initialize figure and define attributes
p = figure(plot_width=900, plot_height=500, tools=[hover, 'wheel_zoom', 'pan', 'reset'], 
           title= "Blood Glucose Levels Over Time for Patient 01", title_text_font_size='18pt')

#Plot a figure for each cluster, calling a different source for each cluster
# p.circle('x', 'y', line_color= '#f2a02e', fill_color= '#f2a02e', fill_alpha= 0.3, source=source0,
#                            line_width= 4, size= 'marker_size')
p.line('x', 'y', line_color= '#f2a02e', source=source0)

#Finally, plot a 'skeleton' scatter plot that draws centroid for all points
#without using a data source
p.circle(glucoseOnly['datetime'], glucoseOnly['bg_value'], fill_color= 'red', line_color= 'red', source=source0)

#Now let's add the reference lines
p.line(glucoseOnly['datetime'], [250]*len(glucoseOnly['bg_value']), line_dash= 'dashed', line_color= 'black')
p.line(glucoseOnly['datetime'], [80]*len(glucoseOnly['bg_value']), line_dash= 'dashed', line_color= 'black')

#And finally the moving average
p.line(glucoseOnly['datetime'], bgMovingMean3, line_color= '#410be2')


p.xaxis[0].formatter = NumeralTickFormatter(format="0")
p.yaxis[0].formatter = NumeralTickFormatter(format="0")

p.xaxis.formatter=DatetimeTickFormatter(formats=dict(
        hours=["%d %B %Y"],
        days=["%d %B %Y"],
        months=["%d %B %Y"],
        years=["%d %B %Y"],
    ))
p.xaxis.major_label_orientation = np.pi/4

p.xaxis.axis_label = "Date"
p.yaxis.axis_label = "Blood Glucose Level"
p.xaxis.axis_label_text_font_size = "12pt"
p.yaxis.axis_label_text_font_size = "12pt"

p.title_text_font_style = "bold"

show(p)

## Data Consolidation

<a id='Data Consolidation'></a>

Now that I more or less understand a single user's data, I think it could be useful to put the data for ALL users into the same table. We could use the file suffix as the user_id as an extra column and potentially one hot encode the codes to turn them into features we could use. Let's first look at the files that are at our disposal. I know our files are in a subfolder called 'Diabetes-Data'.

In [19]:
import os 

subDir= os.listdir(os.getcwd() +'/Diabetes-Data')
print subDir

['data-01', 'data-02', 'data-03', 'data-04', 'data-05', 'data-06', 'data-07', 'data-08', 'data-09', 'data-10', 'data-11', 'data-12', 'data-13', 'data-14', 'data-15', 'data-16', 'data-17', 'data-18', 'data-19', 'data-20', 'data-21', 'data-22', 'data-23', 'data-24', 'data-25', 'data-26', 'data-27', 'data-28', 'data-29', 'data-30', 'data-31', 'data-32', 'data-33', 'data-34', 'data-35', 'data-36', 'data-37', 'data-38', 'data-39', 'data-40', 'data-41', 'data-42', 'data-43', 'data-44', 'data-45', 'data-46', 'data-47', 'data-48', 'data-49', 'data-50', 'data-51', 'data-52', 'data-53', 'data-54', 'data-55', 'data-56', 'data-57', 'data-58', 'data-59', 'data-60', 'data-61', 'data-62', 'data-63', 'data-64', 'data-65', 'data-66', 'data-67', 'data-68', 'data-69', 'data-70', 'Data-Codes', 'Domain-Description', 'README-DIABETES']


Ok so we know that there are 70 files. Our goal is to append the content of each table to a new dataframe that will contain the total data. This is how our workflow will work for each one:

1) Open new dataframe

2) Assign a userid that matches the item number in our iteration for all rows that belong to this file.

3) Perform a union to append the content of new dataframe to total dataframe.

Since our items of interest are the first 70 items in our list, we'll just start the loop there instead of doing regex to match requisites for our file title :)

In [72]:
#Initialize totalData dataframe

totalData= pd.DataFrame(columns= ['date','time','code','bg_value','uid'])

In [75]:
#Iterate through our directory list and concatenate rows from
#different users to totalData dataframe

for i in range(67,70):
    
    tempData= pd.read_csv('Diabetes-Data/' + subDir[i], sep= '\t', header = None)
    tempData.columns= ['date','time','code','bg_value']
    
    tempData['datetime']= pd.to_datetime(tempData['date']+' '+tempData['time'])
    tempData= tempData.drop(['date','time'], axis= 1)
    tempUid= [i+1]*len(tempData)
    tempData['uid']= tempUid
    totalData= pd.concat([totalData,tempData])

I got errors reading files 19 and 66. Basically they said that the number of days didn't match that particular month. So for now, I just skipped them. I'll also drop the 'date' and 'time' columns since I already added the 'datetime' column. (0,19), (20,66), (67,70)

In [76]:
totalData= totalData.drop(['date','time'], axis=1)
totalData.head()

Unnamed: 0,bg_value,code,datetime,uid
0,100,58.0,1991-04-21 09:09:00,1.0
1,9,33.0,1991-04-21 09:09:00,1.0
2,13,34.0,1991-04-21 09:09:00,1.0
3,119,62.0,1991-04-21 17:08:00,1.0
4,7,33.0,1991-04-21 17:08:00,1.0


In [147]:
#Now let's calculate the dimensions of our dataset and list the 
#unique uids in our table
print totalData.shape
len(totalData['uid'].unique())

(27360, 4)


68

In [80]:
#Finally, let's write this table to a csv 
#so that we can just load it later and make it available to others

totalData.to_csv('diabetes_total_data',header=True)

## Processing Data

<a id='Processing Data'></a>

What we can do now is calculate a baseline, essentially a median value for the time traces of all users, or to begin with just have a heatmap with users as rows and time as columns. We can hover over each element in the grid and get more info about that date (including other activities that took place on that day). That way we can calculate deviation from that for other users.

In [83]:
#Let's see if 
len(totalData['datetime'].unique())

13903

In [100]:
bgData['code'].unique()

array([ 58.,  62.,  48.,  60.,  64.,  61.,  63.,  57.,  59.])

In [88]:
bgData= totalData[totalData['code'].isin(glucose_codes)]
bgData.head()

Unnamed: 0,bg_value,code,datetime,uid
0,100,58.0,1991-04-21 09:09:00,1.0
3,119,62.0,1991-04-21 17:08:00,1.0
5,123,48.0,1991-04-21 22:51:00,1.0
6,216,58.0,1991-04-22 07:35:00,1.0
10,211,62.0,1991-04-22 16:56:00,1.0


So I think that keeping the time column instead of date column will be more useful for predictions. Having the date column without the time component can make it easier to visualize the data though. Let's try removing the time first.

In [107]:
bgDataDate= bgData.copy()
bgDataDate['datetime']= bgData['datetime'].dt.date
print bgDataDate.shape
bgDataDate.head()

(12610, 4)


Unnamed: 0,bg_value,code,datetime,uid
0,100,58.0,1991-04-21,1.0
3,119,62.0,1991-04-21,1.0
5,123,48.0,1991-04-21,1.0
6,216,58.0,1991-04-22,1.0
10,211,62.0,1991-04-22,1.0


In [116]:
len(bgDataDate['bg_value'].unique())

686

Ok so it looks like there are still a lot of dates. I'm now going to group by datetime, uid, and code in order to average each user's glucose measurements by date. 

In [132]:
dataReduced.dtypes

bg_value    float64
code        float64
dtype: object

In [194]:
dataReduced= bgDataDate.copy()
dataReduced['bg_value']= dataReduced['bg_value'].convert_objects(convert_numeric=True)

In [195]:
dataReduced.dtypes

bg_value    float64
code        float64
datetime     object
uid         float64
dtype: object

In [196]:
dataReduced= dataReduced.groupby(['datetime','uid']).median()
dataReduced.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,bg_value,code
datetime,uid,Unnamed: 2_level_1,Unnamed: 3_level_1
1988-03-27,68.0,158.0,60.0
1988-03-28,68.0,135.0,60.0
1988-03-31,68.0,147.0,60.0
1988-04-02,68.0,136.0,60.0
1988-04-04,68.0,111.5,60.0


OK above, I grouped the table by day and then I took the median bg_value for that date. Below, I changed that hierarchical table back into a regular pandas dataframe.

In [198]:
dataReduced= dataReduced.reset_index()
dataReduced.head(3)

Unnamed: 0,datetime,uid,bg_value,code
0,1988-03-27,68.0,158.0,60.0
1,1988-03-28,68.0,135.0,60.0
2,1988-03-31,68.0,147.0,60.0


Ok now that we have this, in theory we can construct a table that has users as rows and datetime as columns. Not every column will have a value, but we can just keep a null there, then do the imshow command to make the heatmap. The last fancy thing we could add is ranking the users by median bg value?

## Constructing Glucose Time Grid

<a id='Constructing Glucose Time Grid'></a>

In [148]:
#First let's determine the dimensions of our current dataframe
print dataReduced.shape

(3606, 2)


In [232]:
#Just checking if we can query a table 's column (bg_value here) if we specify value for uid
158.0 in dataReduced[dataReduced['uid']== 68]['bg_value'].tolist()

True

In [312]:
#Here I'm just checking that given TWO CONDITIONS (uid value and code value), 
#I can return the corresponding bg_value (this works here, but notice how it won't in our 
#inner loop below. Look at line 13 in our nested loop cell. There we'll have to use ...['bg_value'].item())
dataReduced[(dataReduced['uid']== 68) & (dataReduced['code']== 60.0)]['bg_value'][0]

158.0

Now we'll start constructing the glucose time grid. The first thing we need to do is define the x-axis. Here it will be time and each column will correspond to a unique datetime value (which will have a corresponding index to simplify its construction). 

Thus, we'll 1) sort a unique list of dates, 2) create indices for them, and 3) finally map the dates to the indices in a dictionary. 

In [183]:
dates= np.sort(np.array(bgDataDate['datetime'].unique()))
dateIndex= [i for i in range(0,len(dates))]
dateDict= dict(zip(dateIndex,dates))

To clarify: We want to visualize the glucose values for all users, but not all users had glucose activity on those dates (~1100 days). That means that most of our users will have blanks at the majority of the columns. 

Below, we'll interate through the rows (uids, i), then columns (datetimes) in our new matrix. If the user has a value on that date (column), then we'll assign the corresponding median bg value to that cell, else we'll just add a nan (nans are more useful to calculate statistics on matrix, a zero will be more useful for plotting grid, but we can replace nans right before we imshow grid).

In [329]:
#Initialize list of lists
glucTimeGrid= []

#Iterate through unique user ids (sorted already) and initialize row vector
for i in bgDataDate['uid'].unique():
    rowVector= []
    #Iterate through dates (via date indices)
    for j in range(0,len(dateIndex)):
        #If dateDict[index], which is a date, is in user's list of dates
        if dateDict[j] in dataReduced[dataReduced['uid']== i]['datetime'].tolist():
            #Append bg_value for given (uid & date) (add print j if wanna troubleshoot)
            rowVector.append(dataReduced[(dataReduced['uid']== i) & 
                                           (dataReduced['datetime']== dateDict[j])]['bg_value'].item())
        else:
            #Else append 0
            rowVector.append(np.nan)
    #Append row to list of lists (glucose matrix)        
    glucTimeGrid.append(rowVector)

In [330]:
#Let's check the dimensions of our final matrix
np.array(glucTimeGrid).shape

(68, 1118)

If you want to doublecheck the results of individual uid's or rows, you can check the value of rowVector. You can also check the values of a subset of the glucTimeGrid to see if it makes sense (should be lots of nans and then a short sequence of non-nan numbers.

In [370]:
%matplotlib nbagg
figure, ax = plt.subplots(1,1, figsize=(10,6))
plt.imshow(np.array(glucTimeGrid), interpolation='nearest',cmap= 'viridis', vmax= 250, vmin= 80)
plt.colorbar(label='BG Values')
ax.set_aspect('auto')
ax.set_axis_bgcolor('#333435')
plt.xlabel('Day', fontsize= 18, fontweight= 'bold')
plt.ylabel('User', fontsize= 18, fontweight= 'bold')
plt.title('Daily Blood Glucose Values for 68 Diabetic Subjects', fontsize= 18, fontweight= 'bold')

<IPython.core.display.Javascript object>

<matplotlib.text.Text at 0x13697fc90>

This is great! Now we have a good way of monitoring everyone's daily glucose levels and to pick out the unhealthy ones very easily. We can also see if there are seasonal correlations (eg. patients that overlap in days show similar trends in blood glucose values). In the final step of our analysis we'll try to determine what other trends the subjects have in common.