# Train your model on your own data with sdtm-mapper

March 15, 2019
## 1. About

  This is a demo for training your model on your own data with pre-built models using python package `sdtm-mapper`. 
  Hyperparameter tuning part will be covered in the different tutorial.


## 2. Steps

  *     You may want to select GPU instance. In Google Colab, go to 'Runtime' $\rightarrow$ 'Change runtime  type' $\rightarrow$ select `GPU`.
  * Load your training, validation, and test dataset in csv files.
  
  * select which pre-build model you want to use, OR you can add your own model architecture, or adjust hyperparameter settings in SDTMModel class.

  * If you work in Colab, you will need to install sas7bdat, pathlib, and tensorflow_hub.





## 3. Installation

You will need to have `sas7bdat`, `tensorflow-hub`, and `pathlib` installed on Google Colab.


In [0]:
!pip install sas7bdat tensorflow-hub pathlib



In [0]:
!pip install sdtm-mapper



## 4. Import necessary Package

In [0]:
import pandas as pd
import os
import numpy as np
import tensorflow as tf
import tensorflow_hub as hub
from keras import backend as K
from keras.callbacks import ModelCheckpoint

# Here you import sdtm_mapper
import sdtm_mapper.SDTMModels as sdtm
import sdtm_mapper.SDTMMapper as mapper
from sdtm_mapper import samples

bucket='snvn-sagemaker-1' #data bucket
KEY='mldata/Sam/data/project/xxx-000/xxx/xxx-201/csr/data/raw/latest/'
# Initialize session

tf.keras.backend.clear_session()
sess = tf.Session()
K.set_session(sess)

W0314 00:34:39.888021 140512926726016 __init__.py:56] Some hub symbols are not available because TensorFlow version is less than 1.14
Using TensorFlow backend.


## 5. Train the model

  First, upload your **training data, validation data** and **test data**. 

  * X_train and y_train will be used for Training.

  * X_valid and y_valid will be used for validation.

  * I used 18 studies for Training, 3 studies for Validation


  Let's look at the training data.

In [0]:
pd.read_csv('ae_validation.csv').head()

Unnamed: 0,ID,text,sdtm
0,S361203.AE.AESMIE,AESMIE OTHER MEDICALLY IMPORTANT SERIOUS EVENT,AESMIE
1,S361203.AE.AEOUT,AEOUT OUTCOME OF ADVERSE EVENT,AEOUT
2,S361203.AE.SUBJECTID,SUBJECTID INTERNAL ID FOR THE SUBJECT,DROP
3,S361203.AE.PROJECTID,PROJECTID PROJECTID,DROP
4,S361203.AE.AEENDAT_INT,AEENDAT INT END DATE OF ADVERSE EVENT INTERPOL...,DROP


Here we load the SDTMMapper object so we can start using some helper functions.

In [0]:
sdtmmap=mapper.SDTMMapper('ae', bucket, KEY)

Let's load training data, validation data, and test data contains your $X$ and perform basic pre-processing. 

`drop_sys_vars` will preprocess the metadat

You can specify  variables with some `suffix` in which you want to exclude from training as in this example . Here I am excluding several variables ending with below suffix in regular expressions. These may be specific to Medidata Rave EDC Core Configurations.

```
suffix='.*_RAW$|.*_INT$|.*_STD$|.*_D{1,2}$|.*_M{1,2}$|.*_Y{1,4}|.*_CV$'
```


In [0]:
suffix='.*_RAW$|.*_INT$|.*_STD$|.*_D{1,2}$|.*_M{1,2}$|.*_Y{1,4}|.*_CV$'
trdropped, X_train, trdf= sdtmmap.drop_sys_vars('ae_training.csv', 'rave', suffix)
vdropped,  X_valid, vdf = sdtmmap.drop_sys_vars('ae_validation.csv', 'rave', suffix)
tdropped,  X_test,  tdf = sdtmmap.drop_sys_vars('ae_test_study.csv', 'rave', suffix)

Encode $Y$ which contains sdtm variables with `encode_sdtm_target`. The second argument is the name of the encoding pickle file. It is saved in decode folder. You need to use appropriate pickle file to decode $\hat{y}$

Encoding can be decoded with `sdtmmap.decode_sdtm_target(Y)` later

In [0]:
y_train = sdtmmap.encode_sdtm_target(trdf['sdtm'].str.upper(), 'train_encode')
y_valid = sdtmmap.encode_sdtm_target(vdf['sdtm'].str.upper(), 'valid_encode')
y_test  = sdtmmap.encode_sdtm_target(tdf['sdtm'].str.upper(), 'test_encode')

In [0]:
trdf['sdtm'].shape, y_train.shape

((584,), (584, 34))

You may want to check the number of classes in the training and the validation.

In [0]:
print("extra in training:   ",[i for i in set(trdf['sdtm']) if i not in  set(vdf['sdtm'])])
print("extra in validation: ",[i for i in set(vdf['sdtm']) if i not in  set(trdf['sdtm'])])

extra in training:    []
extra in validation:  []


In [0]:
shape=len(set(trdf['sdtm']))
print("Number of classes in training: ",shape)
shape=len(set(vdf['sdtm']))
print("Number of classes in validation: ",shape)

Number of classes in training:  34
Number of classes in validation:  34


In [0]:
X_train.shape, X_valid.shape

((584,), (100,))


Let's create a SDTMModels object and fit the model 1. 

**Here is the step:**
1. Create a SDTMModels object. You will need to provide the domain name, and the number of classes.
2. Build a model. Here I am calling a pre-build model 1. As a reminder, `sdtm-mapper` comes with three pre-build models. The second argument of `build_model` is boolean  value for the model summary.

**Optional steps:**

I will cover this part in another tutorial:
1. You can adjust the hyperparameter of each pre-build models or you can add additional models in `SDTMModels` class, and call it with a first argument of `build_model` function. 

2. You could also create your own model here without using `build_model`, and once you satisfied with your model, you can add that to SDTMModels class. Please consider sharing your model architecture with others.

3. Depends on your model, you may want to do additional transformation of $X$. E.g., if you are building RNN. LSTM etc, you need to tokenize, and adjust each sample to the same shape. 



In [0]:
models=sdtm.SDTMModels('ae', shape)
model1 = models.build_model(1, False)

filepath='chkpt_Elmo+sfnn+ae+Model1.hdf5'

checkpointer = ModelCheckpoint(filepath, monitor='val_acc', verbose=1, save_best_only=True, mode='max')

history = model1.fit(X_train, y=y_train, batch_size=32, verbose=1, validation_data=(X_valid,y_valid), 
          shuffle=True, epochs=5, callbacks=[checkpointer])

Instructions for updating:
Colocations handled automatically by placer.


W0314 00:34:52.803727 140512926726016 deprecation.py:323] From /usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/control_flow_ops.py:3632: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.
Instructions for updating:
Colocations handled automatically by placer.


INFO:tensorflow:Saver not created because there are no variables in the graph to restore


I0314 00:34:53.638039 140512926726016 saver.py:1483] Saver not created because there are no variables in the graph to restore


Instructions for updating:
Use tf.cast instead.


W0314 00:34:53.850312 140512926726016 deprecation.py:323] From /usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/math_ops.py:3066: to_int32 (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.cast instead.


Train on 584 samples, validate on 100 samples
Epoch 1/5

Epoch 00001: val_acc improved from -inf to 0.56000, saving model to chkpt_Elmo+sfnn+ae+Model1.hdf5
Epoch 2/5

Epoch 00002: val_acc improved from 0.56000 to 0.75000, saving model to chkpt_Elmo+sfnn+ae+Model1.hdf5
Epoch 3/5

Epoch 00003: val_acc improved from 0.75000 to 0.87000, saving model to chkpt_Elmo+sfnn+ae+Model1.hdf5
Epoch 4/5

Epoch 00004: val_acc improved from 0.87000 to 0.96000, saving model to chkpt_Elmo+sfnn+ae+Model1.hdf5
Epoch 5/5

Epoch 00005: val_acc improved from 0.96000 to 0.97000, saving model to chkpt_Elmo+sfnn+ae+Model1.hdf5


As you can see in the training and validation accuracy, we have higher accuracy in validation set than the training set, so you may want to increase the epoch size, and I would also reduce the batch size. You could also adjust the hyperparameters.



In [0]:
model1.save('Elmo+sfnn+ae+Model1.h5')

# 6. Test the model on test set

In [0]:
def macro_acc_test_data(model, X_=X_test, testdf=tdf, droppeddf=tdropped ):
    
    predictions = model1.predict(X_)
    testdf['pred']=sdtmmap.decode_sdtm_target(predictions, 'train_encode')
    df=sdtmmap.add_drop(testdf,droppeddf)
    return df, sum(df['sdtm']==df['pred'])/len(df)

df, acc=macro_acc_test_data(model1)
print("macro acc: ",acc)

macro acc:  0.9425287356321839


So, without any adjustment, we got 94.25 accuracy.  Let's check where the model made mistakes.

In [0]:

df[df['sdtm']!=df['pred']]

Unnamed: 0,ID,text,sdtm,pred
4,LLT_NAME,LLT_NAME LLT_NAME,AELLT,AEHLGT
5,LLT_CODE,LLT_CODE LLT_CODE,AELLTCD,DROP
8,HLT_NAME,HLT_NAME HLT_NAME,AEHLT,DROP
17,AEENTIM,AEENTIM Stop Time,AEENDTC_TM,DROP
32,AESTTIMU,AESTTIMU Start Time Unknown,DROP,AESTDTC_TM


The preparation of the training data and pre-processing are not so fun, but trying various techniques to improve the score is always fun!

By adding more traing data or adjusting the hyperparameters or trying with different model architecture, the score should improve.  You may want to verify that your training data is clean and accurate. You may find that you made some mistakes by reviewing where the model made mistakes.

If you are satisfied with what you got from the model, you can export it. You can also add the exported file and concatenate to the training set, so you can improve your model even better!

In [0]:

df.to_csv('results.csv')