## Data Import

Python libraries required: numpy, pandas and tensorflow

In [None]:
import numpy as np
import tensorflow as tf
import pandas as pd

!pip install tensorflow-hub
import tensorflow_hub as hub
import tensorflow_datasets as tfds

#print("Version: ", tf.__version__)
#print("Eager mode: ", tf.executing_eagerly())
#print("Hub version: ", hub.__version__)
#print("GPU is", "available" if tf.config.experimental.list_physical_devices("GPU") else "NOT AVAILABLE")

If working in Google Colab, use drive.mount() to enable imports from Google Drive

In [None]:
from google.colab import drive
drive.mount('/content/drive')

## Data preprocessing
Load in test and training data files

In [8]:
train_file = '/content/drive/My Drive/drugLib_raw/drugLibTrain_raw.tsv'


train_df = pd.read_csv(train_file,sep='\t')

## Initial data inspection

This dataset contains 8 columns column describing drug information, patient condition, the resultant treatment, effectiveness and patient reviews. 

In this text classification exercise the aim is to predict the `effectiveness` from this dataset. 

There are 3 columns of descriptive patient review data: `benefitsReview`, `sideEffectsReview` and `commentsReview`. 

The data in these columns will be used for the text classification to predict the `effectiveness`.

This is a publically available dataset that can be found here along with a more comprehensive description of the data:
https://archive.ics.uci.edu/ml/datasets/Drug+Review+Dataset+%28Druglib.com%29


In [193]:
train_df.head()

Unnamed: 0.1,Unnamed: 0,urlDrugName,rating,effectiveness,sideEffects,condition,benefitsReview,sideEffectsReview,commentsReview,combinedReview,label
0,1366,biaxin,9,Considerably Effective,Mild Side Effects,sinus infection,The antibiotic may have destroyed bacteria cau...,"Some back pain, some nauseau.",Took the antibiotics for 14 days. Sinus infect...,The antibiotic may have destroyed bacteria cau...,1.0
1,3724,lamictal,9,Highly Effective,Mild Side Effects,bipolar disorder,Lamictal stabilized my serious mood swings. On...,"Drowsiness, a bit of mental numbness. If you t...",Severe mood swings between hypomania and depre...,Lamictal stabilized my serious mood swings. On...,2.0
2,3824,depakene,4,Moderately Effective,Severe Side Effects,bipolar disorder,Initial benefits were comparable to the brand ...,"Depakene has a very thin coating, which caused...",Depakote was prescribed to me by a Kaiser psyc...,Initial benefits were comparable to the brand ...,3.0
3,969,sarafem,10,Highly Effective,No Side Effects,bi-polar / anxiety,It controlls my mood swings. It helps me think...,I didnt really notice any side effects.,This drug may not be for everyone but its wond...,It controlls my mood swings. It helps me think...,2.0
4,696,accutane,10,Highly Effective,Mild Side Effects,nodular acne,Within one week of treatment superficial acne ...,Side effects included moderate to severe dry s...,Drug was taken in gelatin tablet at 0.5 mg per...,Within one week of treatment superficial acne ...,2.0


The train and test data is currently in tab delimited format and will be converted into Pandas Dataframes. 

An additional column `combinedReview` has been added which contains all the review data from the 3 columns, concatenated.

Another additional column `label` has been included in these Dataframes that assigns classification labels `effectiveness` as integer values so that they can be read later by the model.

In [142]:
test_file = '/content/drive/My Drive/drugLib_raw/drugLibTest_raw.tsv'
labels_dict = {}
for count,label in enumerate(train_df["effectiveness"].unique()):
  labels_dict[label] = count+1

def tsv2df(filename):
  df = pd.read_csv(filename,sep='\t')
  df["combinedReview"] = np.nan
  df["label"] = np.nan
  for row in df.itertuples():
    drug_review = ""
    benefitsReview = df.loc[row.Index,["benefitsReview"]].values[0]
    sideEffectsReview = df.loc[row.Index,["sideEffectsReview"]].values[0]
    commentsReview = df.loc[row.Index,["commentsReview"]].values[0]
    # concatenate review data from all 3 columns into a new column
    reviews = [benefitsReview, sideEffectsReview,commentsReview]
    for review in reviews:
      if pd.isnull(review):
        continue
      drug_review += review +" "
    if drug_review.strip() == "":
      continue
    df.loc[row.Index,["combinedReview"]] = drug_review
    # Use integers to define classification labels
    df.loc[row.Index,["label"]] = labels_dict[train_df.loc[row.Index,["effectiveness"]].values[0]]
  return df

train_df = tsv2df(train_file)
test_df = tsv2df(test_file)

Convert train and test dataframes into Tensorflow datasets. Split the full train dataset into validation and test sets

In [190]:
# A utility method to create a tf.data dataset from a Pandas Dataframe
def df_to_dataset(dataframe, shuffle=True, batch_size=32):
  dataframe = dataframe.copy()
  labels = dataframe.pop('label')
  ds = tf.data.Dataset.from_tensor_slices((dict(dataframe), labels))
  if shuffle:
    ds = ds.shuffle(buffer_size=len(dataframe))
  ds = ds.batch(batch_size)
  return ds

full_train_tfds = df_to_dataset(train_df)
test_tfds = df_to_dataset(test_df)

full_train_tfds.shuffle(32)
def is_val(x, y):
    return x % 4 == 0

def is_train(x, y):
    return not is_val(x, y)

recover = lambda x,y: y

val_tfds = full_train_tfds.enumerate() \
                    .filter(is_val) \
                    .map(recover)

train_tfds = full_train_tfds.enumerate() \
                    .filter(is_train) \
                    .map(recover)

Optional: Run the lines of code below you would like to view examples of the features and labels created in the Tensorflow Dataset for one batch of data

In [None]:
for feature_batch, label_batch in train_tfds.take(1):
  print('Every feature:', list(feature_batch.keys()))
  print('A batch of combinedReviews:', feature_batch['combinedReview'])
  print('A batch of targets:', label_batch )

## Build the model

Creating a Keras layer using a pre-trained model from TensorFlow Hub to convert the `combined reviews` into embeddings. The embedding converts each `combined review` into a 20 dimension array (despite the length and contents of the review). An example of a sentence embedding is printed here.

In [201]:
embedding = "https://tfhub.dev/google/tf2-preview/gnews-swivel-20dim/1"
hub_layer = hub.KerasLayer(embedding, input_shape=[], 
                           dtype=tf.string, trainable=True)
a = hub_layer(feature_batch['combinedReview'])
# Prints one 20 dimensional embedding array 
a[0]














<tf.Tensor: shape=(20,), dtype=float32, numpy=
array([ 3.9738226, -2.4324079,  1.7600758,  1.7758725, -4.506548 ,
       -4.833144 , -3.5563126,  4.78504  ,  4.196502 , -0.4463017,
       -3.0088549,  4.280519 ,  0.5396668,  0.6327668, -6.863371 ,
        2.5921571,  5.555472 , -3.1780574, -3.8864355, -2.6467786],
      dtype=float32)>

Build the model using a pre-trained model from Tensorflow. This model contains text embeddings which is trained on English Google News (130GB corpus). 


https://tfhub.dev/google/tf2-preview/gnews-swivel-20dim/1

After the model is built a summary is printed.

In [184]:
model = tf.keras.Sequential()
model.add(hub_layer)
model.add(tf.keras.layers.Dense(16, activation='relu'))
model.add(tf.keras.layers.Dense(1))

model.summary() 

Model: "sequential_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
keras_layer_14 (KerasLayer)  (None, 20)                400020    
_________________________________________________________________
dense_4 (Dense)              (None, 16)                336       
_________________________________________________________________
dense_5 (Dense)              (None, 1)                 17        
Total params: 400,373
Trainable params: 400,373
Non-trainable params: 0
_________________________________________________________________


Building a loss function and optimizer for training

In [185]:
model.compile(optimizer='adam',
              loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
              metrics=['accuracy'])

## Train the model

Training the model over 20 epochs in batches of 32 using the train and validation datasets. The model's loss and accuracy will be monitored over 10,000 samples from the validation set.

In [None]:
#history = model.fit(train_tfds.shuffle(10000).batch(32),epochs=20,validation_data=val_tfds.batch(32), verbose=1)
history = model.fit([train_examples,train_labels],)

## Evaluate the model

Evaluate the model performance on the test dataset.

## Acknowledgements

This code was written using the Tensorflow tutorial documentation as a guide