# Tutorial on how to use SDTMMapper to generate mapping specifications

This tutorial will cover how to generate mapping specifications for your datasets.

In [1]:
import tensorflow as tf
import pandas as pd
import tensorflow_hub as hub
import os
import re
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score
from sklearn.metrics import cohen_kappa_score
from sklearn.metrics import precision_recall_fscore_support
from sklearn.metrics import classification_report
from keras import backend as K
import keras.layers as layers
from keras.models import Model, load_model
from keras.callbacks import ModelCheckpoint
from keras.models import Sequential
from keras.layers import Dense
from keras.preprocessing.sequence import pad_sequences
from keras.layers import Input, Dense, Dropout, Embedding, LSTM, Flatten
from keras.models import Model
from keras.utils import to_categorical, np_utils
from keras.engine import Layer
#from keras.esimator import model_to_estimator
import numpy as np

from SDTMMapper import SDTMMapper, SDTMModels
bucket='snvn-sagemaker-1' #data bucket
KEY='mldata/Sam/data/project/380-000/bipolar/380-201/csr/data/raw/latest/'
# Initialize session
sess = tf.Session()
K.set_session(sess)

Using TensorFlow backend.


## 1. Load SDTMMapper 

1 Load `SDTMMapper` utilities

2 Load SAS datasets from s3 to try a model on new data. If your datasets are in local, then sent `s3=False`.

`SDTMMapper(domain, isS3, bucket='', KEY='', localpath='',**kwargs)`

- `domain` - name of the dataset e.g. ae
- `isS3` - is the datasets on s3?
- `bucket` -  s3 specific
- `KEY` - s3 specific
- `localpath` - the directory path to the folder containing datasets

In [2]:
sdtmmap=SDTMMapper.SDTMMapper('ae', True, bucket, KEY)

## 2. Load SDTMModels

1. Specify domain name and the model you want to use. 
2. For this tutorial, I will use `model 3`: `Elmo+fnn+ae+Model3.h5`. Set the size of softmax layer to `34`.

In [3]:
models = SDTMModels.SDTMModels('ae', 34)
model = models.build_model(3, False)
model.load_weights('../model/Elmo+fnn+ae+Model3.h5')

INFO:tensorflow:Using /tmp/tfhub_modules to cache modules.
INFO:tensorflow:Saver not created because there are no variables in the graph to restore


## 3. Generate Mapping Spec Template in CSV file

In this step, SAS variable names and variable labels are extracted and stored in a csv file.

The CSV file will be stored in `test_data` folder. If the folder does not exist, it will create for you.

You need to specify the encoding of your SAS dataset, and the file name of the output CSV file.


In [4]:
ae=sdtmmap.sas_metadata_to_csv('latin','test_study_ae.csv')

You can hard code what raw variables should be dropped with regular expressionin `suffix`.

You need to specify what EDC system was used for your raw SAS dataset. Here I am specifying as 'rave'. Currently this is the only EDC system supported.

`drop_sys_vars` generates three outputs. 

1. A Pandas dataframe containing dropping variables,
2. A Pandas Series of variable metadata excluding dropping variables.
3. A Pandas dataframe of variable metadata excluding dropping variables.

All letters will be also converted to lower case.

In [5]:
#Variables to be dropped with these suffic
suffix='.*_RAW$|.*_INT$|.*_STD$|.*_D{1,2}$|.*_M{1,2}$|.*_Y{1,4}$' 

Dt, Xt, df=sdtmmap.drop_sys_vars(os.path.join('test_data','test_study_ae.csv'), 'rave',suffix) 
X=np.array(Xt, dtype=object)[:, np.newaxis]

In [6]:
Dt.head()

Unnamed: 0,ID,text,sdtm,pred
0,PROJECTID,PROJECTID projectid,DROP,DROP
1,PROJECT,PROJECT project,DROP,DROP
2,STUDYID,STUDYID Internal id for the study,DROP,DROP
3,ENVIRONMENTNAME,ENVIRONMENTNAME Environment,DROP,DROP
4,SUBJECTID,SUBJECTID Internal id for the subject,DROP,DROP


## 4. Run the model

In [7]:
output = model.predict(X)


## Generate Output

In [8]:
df['sdtm']=sdtmmap.decode_sdtm_target(output)
spec=sdtmmap.add_drop(df,Dt.iloc[:,:3:])