# 🛠 Exercises

## 0. Prerequisites

In [1]:
# import libraries
import os
import random
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import tensorflow as tf
import tensorflow_hub as hub

In [2]:
# get helper functions
!wget https://raw.githubusercontent.com/yhs2773/TensorFlow-for-Deep-Learning/main/helper_functions.py
from helper_functions import calculate_results

--2024-01-03 12:26:14--  https://raw.githubusercontent.com/yhs2773/TensorFlow-for-Deep-Learning/main/helper_functions.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 11093 (11K) [text/plain]
Saving to: ‘helper_functions.py’


2024-01-03 12:26:14 (62.0 MB/s) - ‘helper_functions.py’ saved [11093/11093]



In [3]:
# get data
!git clone https://github.com/Franck-Dernoncourt/pubmed-rct.git
!ls pubmed-rct

Cloning into 'pubmed-rct'...
remote: Enumerating objects: 39, done.[K
remote: Counting objects: 100% (14/14), done.[K
remote: Compressing objects: 100% (9/9), done.[K
remote: Total 39 (delta 8), reused 5 (delta 5), pack-reused 25[K
Receiving objects: 100% (39/39), 177.08 MiB | 15.81 MiB/s, done.
Resolving deltas: 100% (15/15), done.
Updating files: 100% (13/13), done.
PubMed_200k_RCT				       PubMed_20k_RCT_numbers_replaced_with_at_sign
PubMed_200k_RCT_numbers_replaced_with_at_sign  README.md
PubMed_20k_RCT


In [4]:
# set directory for the 20k dataset
data_dir = "pubmed-rct/PubMed_20k_RCT_numbers_replaced_with_at_sign/"

In [5]:
# list of target directories
filenames = [data_dir + filename for filename in os.listdir(data_dir)]
filenames

['pubmed-rct/PubMed_20k_RCT_numbers_replaced_with_at_sign/test.txt',
 'pubmed-rct/PubMed_20k_RCT_numbers_replaced_with_at_sign/train.txt',
 'pubmed-rct/PubMed_20k_RCT_numbers_replaced_with_at_sign/dev.txt']

In [6]:
# function to read lines of a document
def get_lines(filename):
    with open(filename, "r") as f:
        return f.readlines()

In [7]:
train_lines = get_lines(data_dir + "train.txt")
train_lines[:20]

['###24293578\n',
 'OBJECTIVE\tTo investigate the efficacy of @ weeks of daily low-dose oral prednisolone in improving pain , mobility , and systemic low-grade inflammation in the short term and whether the effect would be sustained at @ weeks in older adults with moderate to severe knee osteoarthritis ( OA ) .\n',
 'METHODS\tA total of @ patients with primary knee OA were randomized @:@ ; @ received @ mg/day of prednisolone and @ received placebo for @ weeks .\n',
 'METHODS\tOutcome measures included pain reduction and improvement in function scores and systemic inflammation markers .\n',
 'METHODS\tPain was assessed using the visual analog pain scale ( @-@ mm ) .\n',
 'METHODS\tSecondary outcome measures included the Western Ontario and McMaster Universities Osteoarthritis Index scores , patient global assessment ( PGA ) of the severity of knee OA , and @-min walk distance ( @MWD ) .\n',
 'METHODS\tSerum levels of interleukin @ ( IL-@ ) , IL-@ , tumor necrosis factor ( TNF ) - , and 

In [8]:
# function to preprocess data
def preprocess_text_with_line_numbers(filename):
    input_lines = get_lines(filename)
    abstract_lines = ""
    abstract_samples = []

    for line in input_lines:
        if line.startswith("###"):
            abstract_id = line
            abstract_lines = ""
        elif line.isspace():
            abstract_line_split = abstract_lines.splitlines()

            for abstract_line_number, abstract_line in enumerate(abstract_line_split):
                line_data = {}
                target_text_split = abstract_line.split("\t")
                line_data['target'] = target_text_split[0]
                line_data['text'] = target_text_split[1].lower()
                line_data['line_number'] = abstract_line_number
                line_data['total_lines'] = len(abstract_line_split) - 1
                abstract_samples.append(line_data)
        else:
            abstract_lines += line

    return abstract_samples

In [9]:
# preprocess data
train_samples = preprocess_text_with_line_numbers(data_dir + "train.txt")
val_samples = preprocess_text_with_line_numbers(data_dir + "dev.txt")
test_samples = preprocess_text_with_line_numbers(data_dir + "test.txt")

In [10]:
# turn into data frames
train_df = pd.DataFrame(train_samples)
val_df = pd.DataFrame(val_samples)
test_df = pd.DataFrame(test_samples)

In [11]:
# get lists of sentences
train_sentences = train_df['text'].tolist()
val_sentences = val_df['text'].tolist()
test_sentences = test_df['text'].tolist()

In [12]:
# one-hot encode labels
from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder(sparse_output=False)
train_oh = ohe.fit_transform(train_df['target'].to_numpy().reshape(-1, 1))
val_oh = ohe.transform(val_df['target'].to_numpy().reshape(-1, 1))
test_oh = ohe.transform(test_df['target'].to_numpy().reshape(-1, 1))

In [13]:
# label encode labels (instrumental in getting class names)
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
train_le = le.fit_transform(train_df['target'].to_numpy())
val_le = le.transform(val_df['target'].to_numpy())
test_le = le.transform(test_df['target'].to_numpy())

In [14]:
# get num_classes and class_names
num_classes = len(le.classes_)
class_names = le.classes_
num_classes, class_names

(5,
 array(['BACKGROUND', 'CONCLUSIONS', 'METHODS', 'OBJECTIVE', 'RESULTS'],
       dtype=object))

In [15]:
# download pre-trained USE
tf_hub_embedding_layer = hub.KerasLayer("https://tfhub.dev/google/universal-sentence-encoder/4",
                                        trainable=False,
                                        name='universal_sentence_encoder')

In [16]:
# function to split sentences into characters
def split_chars(text):
    return " ".join(list(text))

In [17]:
# split sentence into characters
train_chars = [split_chars(sentence) for sentence in train_sentences]
val_chars = [split_chars(sentence) for sentence in val_sentences]
test_chars = [split_chars(sentence) for sentence in test_sentences]

# check the distribution of character length
char_lens = [len(sentence) for sentence in train_sentences]
output_seq_char_len = int(np.percentile(char_lens, 95))
output_seq_char_len

290

In [18]:
import string

alphabet = string.ascii_lowercase
alphabet

'abcdefghijklmnopqrstuvwxyz'

In [19]:
# create char-level token vectorizer
char_vectorizer = tf.keras.layers.TextVectorization(max_tokens=len(alphabet) + 2,
                                                    output_sequence_length=output_seq_char_len,
                                                    name='char_vectorizer')

# adap character vectorizer
char_vectorizer.adapt(train_chars)

In [20]:
# get char vocab
char_vocab = char_vectorizer.get_vocabulary()

In [21]:
# char embedding layer
char_embed = tf.keras.layers.Embedding(input_dim=len(alphabet) + 2,
                                       output_dim=25,
                                       name='char_embed')

In [22]:
# check the distribution of line_number
int(np.percentile(train_df.line_number, 98))

15

In [23]:
# create line_number one-hot
train_line_numbers_oh = tf.one_hot(train_df['line_number'].to_numpy(), depth=15)
val_line_numbers_oh = tf.one_hot(val_df["line_number"].to_numpy(), depth=15)
test_line_numbers_oh = tf.one_hot(test_df["line_number"].to_numpy(), depth=15)

In [24]:
# check the distribution of total_lines
np.percentile(train_df.total_lines, 98)

20.0

In [25]:
# create total_lines one-hot
train_total_lines_oh = tf.one_hot(train_df['total_lines'].to_numpy(), depth=20)
val_total_lines_oh = tf.one_hot(val_df["total_lines"].to_numpy(), depth=20)
test_total_lines_oh = tf.one_hot(test_df["total_lines"].to_numpy(), depth=20)

## 1. Train `model_5` on all of the data in the training dataset for as many epochs until it stops improving. Since this might take a while, you might want to use:
- [`tf.keras.callbacks.ModelCheckpoint`](https://www.tensorflow.org/api_docs/python/tf/keras/callbacks/ModelCheckpoint) to save the model's best weights only.
- [`tf.keras.callbacks.EarlyStopping`](https://www.tensorflow.org/api_docs/python/tf/keras/callbacks/EarlyStopping) to stop the model from training once the validation loss has stopped improving for ~3 epochs.

In [26]:
# replicate model_5
# token input model
token_inputs = tf.keras.layers.Input(shape=[], dtype=tf.string)
token_embeddings = tf_hub_embedding_layer(token_inputs)
token_outputs = tf.keras.layers.Dense(128, activation='relu')(token_embeddings)
token_model = tf.keras.Model(inputs=token_inputs, outputs=token_outputs)

# char input model
char_inputs = tf.keras.layers.Input(shape=(1,), dtype=tf.string)
char_vectors = char_vectorizer(char_inputs)
char_embeddings = char_embed(char_vectors)
char_bi_lstm = tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(32))(char_embeddings)
char_model = tf.keras.Model(inputs=char_inputs, outputs=char_bi_lstm)

# line numbers model
line_number_inputs = tf.keras.layers.Input(shape=(15,), dtype=tf.int32)
x = tf.keras.layers.Dense(32, activation='relu')(line_number_inputs)
line_number_model = tf.keras.Model(inputs=line_number_inputs, outputs=x)

# total lines model
total_lines_inputs = tf.keras.layers.Input(shape=(20,), dtype=tf.int32)
y = tf.keras.layers.Dense(32, activation='relu')(total_lines_inputs)
total_lines_model = tf.keras.Model(inputs=total_lines_inputs, outputs=y)

# token and char hybrid embedding
combined_embeddings = tf.keras.layers.Concatenate()([token_model.output, char_model.output])
z = tf.keras.layers.Dense(256, activation='relu')(combined_embeddings)
z = tf.keras.layers.Dropout(0.5)(z)

# concat combined embedding with line number and total lines models
z = tf.keras.layers.Concatenate()([line_number_model.output, total_lines_model.output, z])

# output layer
output_layer = tf.keras.layers.Dense(num_classes, activation='softmax')(z)

# model_5
model_5 = tf.keras.Model(inputs=[line_number_model.input,
                                 total_lines_model.input,
                                 token_model.input,
                                 char_model.input],
                         outputs=output_layer)

In [27]:
# model summary
model_5.summary()

Model: "model_4"
__________________________________________________________________________________________________
 Layer (type)                Output Shape                 Param #   Connected to                  
 input_2 (InputLayer)        [(None, 1)]                  0         []                            
                                                                                                  
 input_1 (InputLayer)        [(None,)]                    0         []                            
                                                                                                  
 char_vectorizer (TextVecto  (None, 290)                  0         ['input_2[0][0]']             
 rization)                                                                                        
                                                                                                  
 universal_sentence_encoder  (None, 512)                  2567978   ['input_1[0][0]']       

In [28]:
# model callbacks
ckpt_path = 'model_5/model_5.ckpt'
mckpt = tf.keras.callbacks.ModelCheckpoint(filepath=ckpt_path,
                                           save_best_only=True,
                                           save_weights_only=True)

es = tf.keras.callbacks.EarlyStopping(patience=3,
                                      restore_best_weights=True)

In [29]:
# compile
model_5.compile(optimizer='adam',
                loss=tf.keras.losses.CategoricalCrossentropy(label_smoothing=0.2),
                metrics=['accuracy'])

In [30]:
# datasets
# train dataset
train_features = tf.data.Dataset.from_tensor_slices((train_line_numbers_oh,
                                                     train_total_lines_oh,
                                                     train_sentences,
                                                     train_chars))
train_labels = tf.data.Dataset.from_tensor_slices(train_oh)
train_ds = tf.data.Dataset.zip((train_features, train_labels))
train_ds = train_ds.batch(32).prefetch(tf.data.AUTOTUNE)

# validation dataset
val_features = tf.data.Dataset.from_tensor_slices((val_line_numbers_oh,
                                                   val_total_lines_oh,
                                                   val_sentences,
                                                   val_chars))
val_labels = tf.data.Dataset.from_tensor_slices(val_oh)
val_ds = tf.data.Dataset.zip((val_features, val_labels))
val_ds = val_ds.batch(32).prefetch(tf.data.AUTOTUNE)

In [None]:
history_5 = model_5.fit(train_ds,
                        epochs=500,
                        validation_data=val_ds,
                        validation_steps=int(len(val_ds) * 0.5),
                        callbacks=[mckpt, es])

Epoch 1/200
 569/5627 [==>...........................] - ETA: 22:11 - loss: 1.0987 - accuracy: 0.7253

## 2. Check out the [Keras guide on using pre-trained GloVe embeddings](https://keras.io/examples/nlp/pretrained_word_embeddings/). Can you get this working with one of our models?
- Hint: You'll want to incorporate it with a custom token [Embedding](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Embedding) layer.
- It's up to you whether or not you fine-tune the GloVe embeddings or leave them frozen.

## 3. Try replacing the TensorFlow Hub Universal Sentence Encoder pre-trained embedding for the [TensorFlow Hub BERT PubMed expert](https://tfhub.dev/google/experts/bert/pubmed/2) (a language model pre-trained on PubMed texts) pre-trained embedding. Does this affect results?
- Note: Using the BERT PubMed expert pre-trained embedding requires an extra preprocessing step for sequences (as detailed in the [TensorFlow Hub guide](https://tfhub.dev/google/experts/bert/pubmed/2)).
- Does the BERT model beat the results mentioned in this paper? https://arxiv.org/pdf/1710.06071.pdf

## 4. What happens if you were to merge our `line_number` and `total_lines` features for each sequence? For example, created a `X_of_Y` feature instead? Does this affect model performance?
- Another example: `line_number=1` and `total_lines=11` turns into `line_of_X=1_of_11`.

## 5. Write a function (or series of functions) to take a sample abstract string, preprocess it (in the same way our model has been trained), make a prediction on each sequence in the abstract, and return the abstract in the format:
- `PREDICTED_LABEL: SEQUENCE`
- `PREDICTED_LABEL: SEQUENCE`
- `PREDICTED_LABEL: SEQUENCE`
- `PREDICTED_LABEL: SEQUENCE`
- ...
    - You can find your own unstructured RCT abstract from PubMed or try this one from: [*Baclofen promotes alcohol abstinence in alcohol dependent cirrhotic patients with hepatitis C virus (HCV) infection*](https://pubmed.ncbi.nlm.nih.gov/22244707/).