<a href="https://colab.research.google.com/github/stsan9/EndoMondoResearchERSP/blob/master/EndoRNN.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [0]:
# Import the necessary libraries
%tensorflow_version 2.x
import tensorflow as tf
import numpy as np
import pandas as pd
import math
import os
from sklearn.preprocessing import MinMaxScaler
from tensorflow.python.keras.models import Sequential
from tensorflow.python.keras.callbacks import ModelCheckpoint
from tensorflow.python.keras.layers import Input, Dense, LSTM, Embedding, Dropout
from tensorflow.python.keras.optimizers import RMSprop
from tensorflow.python.keras.callbacks import EarlyStopping, ModelCheckpoint, TensorBoard, ReduceLROnPlateau
from tensorflow.python.keras.models import load_model

TensorFlow 2.x selected.


In [0]:
# Mount the google drive file system
from google.colab import drive
drive.mount('/content/gdrive/')

Drive already mounted at /content/gdrive/; to attempt to forcibly remount, call drive.mount("/content/gdrive/", force_remount=True).


In [0]:
# Load in the data file and store it in a list; data in shared drive
properPath = '/content/gdrive/My Drive/EndoMondoData/endomondoHR_proper.json' # this may be personalized
data = []

with open(properPath) as f:
    for l in f:
        data.append(eval(l))

In [0]:
# convert to pandas dataframe and drop the unsused columns
dataframe = pd.DataFrame.from_dict(data)
dfsave = dataframe
dataframe = dataframe.drop(columns = ["longitude", "altitude", "latitude", "speed", "url", "id", "gender"])

In [0]:
# function to extract first element of each list "l"
def begin(l):
    if isinstance(l, list):
        return l[0]

# function to get the mean of only the middle 300 / 500 timestamps in one workout
def mean(l):
    return np.mean(l[100:400])

In [0]:
# get average heart rate and starting timestamp of all workouts
dataframe['heart_rate'] = dataframe['heart_rate'].apply(mean)
dataframe['timestamp'] = dataframe['timestamp'].apply(begin)

In [0]:
# filtering out suspicious users based on heart rate
bad_users = dataframe[dataframe['heart_rate'] > 185] 
bad_users = dataframe[dataframe['heart_rate'] < 40]
dataframe = dataframe[~dataframe.userId.isin(bad_users['userId'].unique())]

In [0]:
# one hot encode the sports column
one_hot_sport = pd.get_dummies(dataframe.sport)
one_hot_sport
dataframe = pd.concat([dataframe, one_hot_sport], axis = 1)
dataframe = dataframe.drop(columns = "sport")
#dataframe

In [0]:
num_columns = len(dataframe.columns) - 2 # columns - 2 refers to including all the columns except userId and heart_rate
num_columns

44

In [0]:
# number of unique users
len(dataframe["userId"].unique())

1039

In [0]:
# dataframe now only has users who have more than 50 workouts
dataframe = dataframe.groupby("userId").filter(lambda x : len(x) > 50)
len(dataframe["userId"].unique()) # number of unique users after filtering for those with over 40 workouts

698

In [0]:
# Create an object from the Normalizer class
min_scaler = MinMaxScaler()  

In [0]:
"""
@params:
batch_size: how many testing units we want (how many random users do we want to use in training per function call/steps per epoch)
sequence_length: how many workouts per user do we want (e.g. first n number of workouts)

@purpose:
This function is used to generate the training input for the model. Keras models take in
an x and y param, where x is the input, and y is the target output corresponding to the
input.

@algorithm:
Create 2 empty batches. "x_batch" represents what the model will directly take in as input.
"y_batch" represents the true values we want to predict. x and y batches holds data for 
"batch_size" number of users.

To fill all indices in the x and y batch, we perform the for loop below and select
a user one by one from the "userids" list. We extract all of user_x's rows from the dataframe
and sort those rows by their timestamp so the input will be properly sequenced for the model.

We then drop the columns for userId and timestamp for user_x as those are no longer needed 
and don't have any correlation to the heart_rate prediction.

To turn user_x into what will be input the model can accept, we convert it to a numpy array (keras models 
don't take pandas dataframes) and then normalize it's values to be between 0 and 1 (neural networks don't 
work well with large scalars). This final result will be stored in x_scaled, which is then put in x_batch[i].

To turn user_data into what will be the input's corresponding target data, we extract only
the heart_rate, and convert that into a numpy array. This get's stored in y_out and then
y_batch[i].

@returns:
x_batch: input signals to RNN
y_batch: the corresponding target data (heart_rates) to the input signals
"""
def batch_gen(batch_size, sequence_length):
  userids = dataframe['userId'].unique() # contains all the userids that are good enough to evaluate

  while True:
  # Allocate a new array for the batch of input_signals.
    x_shape = (batch_size, sequence_length, 44) # shape of the input
    x_batch = np.zeros(shape=x_shape, dtype=np.float16) # represents the user extracted data used as inputs

    # Allocate a new array for the batch of output-signals.
    y_shape = (batch_size, sequence_length) #shape of the output
    y_batch = np.zeros(shape=y_shape, dtype=np.float16) # represents the user extracted data used as true values

    for i in range(batch_size):
      user_x = dataframe.loc[dataframe["userId"] == userids[i]].sort_values("timestamp") # get a userId and sort the user by timestamp
      user_x = user_x.drop(columns=["userId", "timestamp"]) # drop the user's userId and timestamp
      y_out = user_x.heart_rate 

      x_input = user_x.values[0:sequence_length] #inputs used will be from the range (0 - sequence_length)
      y_out = y_out.values[10:sequence_length + 10] # trues values (predictions) will go from the range (sequence_length - to the end)
        
      x_scaled = min_scaler.fit_transform(x_input) # Scale the x_input data from 0 to 1
        
      x_batch[i] = x_scaled # 
      y_batch[i] = y_out  
    yield (x_batch, y_batch) # returns input signals (x_batch) and corresponding target data(heart_rates) to the input signals (y_batch)

generator = batch_gen(40, 40) # (Batch size: 40), (Sequence length: 40)

In [0]:
"""
Callbacks

- Various callbacks allow monitoring of a model to prevent overfitting.
- The best version of the model (with specific weights) can be saved
- Model will be stopped when loss is not decreasing/is minimized
- Using "EarlyStopping" and "ModelCheckpoint"
  - LR Scheduler?

Resources
- Early Stopping
  - https://machinelearningmastery.com/how-to-stop-training-deep-neural-networks-at-the-right-time-using-early-stopping/
  - Evaluate + Visualize model based on ^^ link
- Model Checkpoint in Google Colab
  - https://medium.com/@mukesh.kumar43585/model-checkpoint-google-colab-and-drive-as-persistent-storage-for-long-training-runs-e35ffa0c33d9
"""

#Stops the model to minimize loss when there is no more improvement
es=EarlyStopping( monitor='val_loss', #quantity to be monitored, validation dataset loss
                  mode='min', #minimizing loss
                  verbose=1, #print out epoch where training was stopped
                  patience=3 #number of epochs with no improvement where training will be stopped
                )

#Save the best model in training for later use
filepath="/content/gdrive/My Drive/EndoMondoData/epochs:{epoch:03d}-val_loss:{val_loss:.3f}.hdf5"
mc= ModelCheckpoint( filepath, #path to where model should be saved
                     monitor='val_loss', #quanity to be monitored
                     mode='min', #minimizing loss
                     verbose='1', #print out epoch saved
                     save_best_only='True' #saves the best model only, won't overwrite per run
                   )

#List of callbacks for model
callbacks_list=[es, mc]

In [0]:
#Load a saved model
#saved_model = load_model('best_model.h5')

In [0]:
# baseline for the rnn, the mse error if the predicted values is the mean of previous workouts
error_sq = [] # list of error values per user

# adds the squared difference of the average per user
for user in dataframe['userId'].unique():
  user_x = dataframe.loc[dataframe["userId"] == user].sort_values("timestamp", ascending=False)
  avg_hr = np.average(user_x.iloc[1: 40]['heart_rate'])
  error_sq += [(user_x.iloc[0]['heart_rate'] - avg_hr) ** 2]


dummy_mse = np.average(error_sq) # the final MSE value
print('Baseline MSE: ' + str(dummy_mse))

Baseline MSE: 190.8077249662216


In [0]:
# build the model (1 LSTM layer and 1 output layer)
model = Sequential()

model.add(LSTM(units = 256, return_sequences=True, input_shape = (None, num_columns,)))  # "44" is the number of columns an input has
model.add(Dropout(0.1))
model.add(Dense(256, activation = 'tanh'))
model.add(Dense(1, activation = 'linear'))  # "linear" activation function f(x) = x

optimizer = tf.keras.optimizers.RMSprop(lr=1e-3) #low learning rate, could change this as well

model.compile(loss='mean_squared_error', optimizer=optimizer)  # using mse loss function

In [0]:
model.fit_generator(generator=generator,  # the batch generator
          epochs=40,            # number of training cycles
          steps_per_epoch=30)#,   # number of calls to generator per cycle
          #callbacks=callbacks_list) #list of callbacks to apply - BROKEN

Instructions for updating:
Please use Model.fit, which supports generators.
  ...
    to  
  ['...']
Train for 30 steps
Epoch 1/40
Epoch 2/40
Epoch 3/40
Epoch 4/40
Epoch 5/40
Epoch 6/40
Epoch 7/40
Epoch 8/40
Epoch 9/40
Epoch 10/40
Epoch 11/40
Epoch 12/40
Epoch 13/40
Epoch 14/40
Epoch 15/40
Epoch 16/40
Epoch 17/40
Epoch 18/40
Epoch 19/40
Epoch 20/40
Epoch 21/40
Epoch 22/40
Epoch 23/40
Epoch 24/40
Epoch 25/40
Epoch 26/40
Epoch 27/40
Epoch 28/40
Epoch 29/40
Epoch 30/40
Epoch 31/40
Epoch 32/40
Epoch 33/40
Epoch 34/40
Epoch 35/40
Epoch 36/40
Epoch 37/40
Epoch 38/40
Epoch 39/40
Epoch 40/40


<tensorflow.python.keras.callbacks.History at 0x7fdaf3721908>

Still need to:
- Add callbacks (save the model after training) - SRAVYA 
- Extract a validation set from the current training set - SRAVYA
- Extract a training set and testing set from the current set - ANDRES
- Evaluate the model and experiment with adding back in other contextual variables
- Modify the data and RNN to output a timestamp as well
- Visualize our RNN's predictions