## Data Import

Python libraries required: numpy, pandas and tensorflow

In [1]:
import numpy as np
import tensorflow as tf
import pandas as pd

!pip install tensorflow-hub
import tensorflow_hub as hub
import tensorflow_datasets as tfds

#print("Version: ", tf.__version__)
#print("Eager mode: ", tf.executing_eagerly())
#print("Hub version: ", hub.__version__)
#print("GPU is", "available" if tf.config.experimental.list_physical_devices("GPU") else "NOT AVAILABLE")



Optional: If working in Google Colab, use drive.mount() so that you can import files from Google Drive into your code

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


## Initial data inspection
Load in test and training data files

In [2]:
train_file = '/content/drive/My Drive/drugLib_raw/drugLibTrain_raw.tsv'
test_file = '/content/drive/My Drive/drugLib_raw/drugLibTest_raw.tsv'

train_df = pd.read_csv(train_file,sep='\t')

This dataset contains 8 columns column describing drug information, patient condition, the resultant treatment, effectiveness and patient reviews. 

In this text classification exercise the aim is to predict the `effectiveness` from this dataset. 

There are 3 columns of descriptive patient review data: `benefitsReview`, `sideEffectsReview` and `commentsReview`. 

The data in these columns will be used for the text classification to predict the `effectiveness`.

This is a publically available dataset that can be found here along with a more comprehensive description of the data:
https://archive.ics.uci.edu/ml/datasets/Drug+Review+Dataset+%28Druglib.com%29


In [3]:
train_df.head()

Unnamed: 0.1,Unnamed: 0,urlDrugName,rating,effectiveness,sideEffects,condition,benefitsReview,sideEffectsReview,commentsReview
0,2202,enalapril,4,Highly Effective,Mild Side Effects,management of congestive heart failure,slowed the progression of left ventricular dys...,"cough, hypotension , proteinuria, impotence , ...","monitor blood pressure , weight and asses for ..."
1,3117,ortho-tri-cyclen,1,Highly Effective,Severe Side Effects,birth prevention,Although this type of birth control has more c...,"Heavy Cycle, Cramps, Hot Flashes, Fatigue, Lon...","I Hate This Birth Control, I Would Not Suggest..."
2,1146,ponstel,10,Highly Effective,No Side Effects,menstrual cramps,I was used to having cramps so badly that they...,Heavier bleeding and clotting than normal.,I took 2 pills at the onset of my menstrual cr...
3,3947,prilosec,3,Marginally Effective,Mild Side Effects,acid reflux,The acid reflux went away for a few months aft...,"Constipation, dry mouth and some mild dizzines...",I was given Prilosec prescription at a dose of...
4,1951,lyrica,2,Marginally Effective,Severe Side Effects,fibromyalgia,I think that the Lyrica was starting to help w...,I felt extremely drugged and dopey. Could not...,See above


## Data preprocessing

The data needs to be pre-processed before inputing it into a model.
A model will perform better if the input data consist of features that have a significant impact on what you are trying to predict and the amount of noise is minimized (data that is deemed insignificant). Several techniques are used here in this analysis exercise:

1.   Data cleansing
2.   Lemmatization
3.   Removal of stop words
4.   Using term frequency-inverse document frequency (TF-IDF)

### 1. Data cleansing

The train and test data is currently in tab delimited format and will be converted into Pandas Dataframes. 

An additional column `combinedReview` has been added which contains all the review data from the 3 columns, concatenated.

Another additional column `label` has been included in these Dataframes that assigns classification labels `effectiveness` as integer values so that they can be read later by the model.

The `combinedReview` data will be cleaned to remove any special (invalid) characters, multiple spaces, numbers, escape characters and any words that are just 1 character long. The review text will also be case-folded (converted all to lower case) so that words that are spelled the same will be grouped together regardless of their upper/lower casing.

### 2. Lemmatization

Lemmatization takes words and reduces them down to its base form i.e. its lemma. This helps to group together words that are similar and can be considered equivalent for the purposes of input features for a model. For example, the words `bird` and `birds` have the same lemma, `bird`, so they will be grouped together.

### 3. Removal of stop words
Stop words, or function words, are words that typically do not provide a lot of information for text mining and are not useful to include in the model. Stop words are the most commonly used words, e.g. this, the, a, of. 

In [8]:
labels_dict = {}
for count,label in enumerate(train_df["effectiveness"].unique()):
  labels_dict[label] = count+1

def tsv2df(filename):
  df = pd.read_csv(filename,sep='\t')
  df["combinedReview"] = np.nan
  df["label"] = np.nan
  for row in df.itertuples():
    drug_review = ""
    benefitsReview = df.loc[row.Index,["benefitsReview"]].values[0]
    sideEffectsReview = df.loc[row.Index,["sideEffectsReview"]].values[0]
    commentsReview = df.loc[row.Index,["commentsReview"]].values[0]
    # concatenate review data from all 3 columns into a new column
    if not pd.isnull(benefitsReview):
      drug_review += ' '.join(clean_review_text(benefitsReview))
    if not pd.isnull(sideEffectsReview):
      drug_review += ' ' + ' '.join(clean_review_text(sideEffectsReview))
    if not pd.isnull(commentsReview):
      drug_review += ' ' + ' '.join(clean_review_text(commentsReview))
    if drug_review.strip() == "":
      continue
    df.loc[row.Index,["combinedReview"]] = drug_review
    # Use integers to define classification labels
    df.loc[row.Index,["label"]] = labels_dict[train_df.loc[row.Index,["effectiveness"]].values[0]]
  return df

import re
import nltk
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

# Function used to clean the test and training drug review data and returns a tokenized list of the words in their lemma form(root form)
def clean_review_text(drug_review):
    # Remove special characters
    drug_review = re.sub('\W', ' ', drug_review)
    # Remove underscores
    drug_review = re.sub('_', '', drug_review)
    # Remove single characters
    drug_review = re.sub(r'\s+[a-zA-Z]\s+', ' ', drug_review)
    # Remove all numbers
    drug_review = re.sub("\d+", "", drug_review)
    # Remove single characters from the start
    drug_review = re.sub(r'\^[a-zA-Z]\s+', ' ', drug_review)
    # Substituting multiple spaces with single space
    drug_review = re.sub(r'\s+', ' ', drug_review, flags=re.I)
    # Removing prefixed 'b'
    drug_review = re.sub(r'^b\s+', '', drug_review)
    # Remove any newline escape characters
    drug_review = re.sub("\n", "", drug_review)
    # Converting to Lowercase
    drug_review = drug_review.lower()
    # Lemmatization
    drug_review = drug_review.split()
    # Replace each word in each text file with its lemma form (root form)
    drug_review = [WordNetLemmatizer().lemmatize(word) for word in drug_review]
    # Remove stop words
    stops = stopwords.words('english')
    clean_drug_review = []

    postags = nltk.pos_tag(drug_review)
    for tag in postags:
        word = tag[0]
        pos = tag[1]
        if pos[:2] == "NN" or pos[0] == "V" or pos[:2] == "RB" or pos[0] == "J":
            if not word in stops and len(word) > 2:
                clean_drug_review.append(word)
    return clean_drug_review


train_df = tsv2df(train_file)
test_df = tsv2df(test_file)

# Remove any duplicate lines in the train dataset and also any rows with no patient review data (no input for the model)
train_df.drop_duplicates()
train_df.dropna(subset=["combinedReview"],inplace=True)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


Viewing the updated, cleaned Dataframe for the training dataset which includes the `combinedReview` column and the classification labels from the `effectiveness` mapped to integers in the `label` column

In [11]:
train_df.head()

Unnamed: 0.1,Unnamed: 0,urlDrugName,rating,effectiveness,sideEffects,condition,benefitsReview,sideEffectsReview,commentsReview,combinedReview,label
0,2202,enalapril,4,Highly Effective,Mild Side Effects,management of congestive heart failure,slowed the progression of left ventricular dys...,"cough, hypotension , proteinuria, impotence , ...","monitor blood pressure , weight and asses for ...",slowed progression left ventricular dysfunctio...,1.0
1,3117,ortho-tri-cyclen,1,Highly Effective,Severe Side Effects,birth prevention,Although this type of birth control has more c...,"Heavy Cycle, Cramps, Hot Flashes, Fatigue, Lon...","I Hate This Birth Control, I Would Not Suggest...",type birth control con help cramp also effecti...,1.0
2,1146,ponstel,10,Highly Effective,No Side Effects,menstrual cramps,I was used to having cramps so badly that they...,Heavier bleeding and clotting than normal.,I took 2 pills at the onset of my menstrual cr...,used cramp badly leave balled bed least day po...,1.0
3,3947,prilosec,3,Marginally Effective,Mild Side Effects,acid reflux,The acid reflux went away for a few months aft...,"Constipation, dry mouth and some mild dizzines...",I was given Prilosec prescription at a dose of...,acid reflux went away month day drug heartburn...,2.0
4,1951,lyrica,2,Marginally Effective,Severe Side Effects,fibromyalgia,I think that the Lyrica was starting to help w...,I felt extremely drugged and dopey. Could not...,See above,think lyrica starting help pain side effect se...,2.0


### 4. Using term frequency-inverse document frequency (TF-IDF)
The reviews contain a very large range of words and it would take a lot of computational power to include every single unique word when training a model. We can use term frequency-inverse document frequency (TF-IDF) which is a statistic that measures how important words are to a corpus (the set of all reviews). TF-IDF depends on how many times a word appears in a single document and also how many different documents that word appears in.

Higher TF-IDF values are given to words that appear  frequently within a single document *AND* if that word appears in a smaller number of documents. This means that if these words are seen in a document, it will be easier to predict what classification that document belongs to, based on that word. Using intuition, a word like `the` would have a low TF-IDF because it would probably appear in all documents. A word like `headache` may have a high TF-IDF because it seems like a word that would appear in a smaller number of reviews, and intuitively it would seem that this word highly suggests that the drug review is negative, for example. A word that is completely unique and only appears once in all the reviews would have a low TF-IDF because it appears infrequently in a single document (an example of this could be a misspelled word, like `headahce`.

We will also allow for up to groups of 2 words and 3 words (called bi-grams and tri-grams) to be included for this analysis. In this example, `side effect` is a bi-gram and results in a high TF-IDF score of 54.122304.

In [17]:
from sklearn.feature_extraction.text import TfidfVectorizer
tv = TfidfVectorizer(ngram_range = (1,3),
                     sublinear_tf = True,
                     max_features = 40000)
train_tv = tv.fit_transform(train_df['combinedReview'])
vocab = tv.get_feature_names()
dist = np.sum(train_tv, axis=0)
vocab_tfidf_df = pd.DataFrame(dist,columns = vocab)
sorted_tfidf_df = vocab_tfidf_df.sort_values(vocab_tfidf_df.first_valid_index(),axis=1,ascending=False)
print("TF-IDF scores for terms sorted by the highest score from left to right")
sorted_tfidf_df.head()

TF-IDF scores for terms sorted by the highest score from left to right


Unnamed: 0,day,effect,side,take,side effect,taking,drug,time,pain,week,medication,treatment,year,month,also,took,pill,skin,none,first,night,get,sleep,daily,feel,hour,started,doctor,symptom,felt,dose,work,depression,taken,better,much,severe,prescribed,still,morning,...,really never,rage time minor,rage medication,really mad thing,quit warfarin several,much blood,really doctor,realized headache able,realized headache,really doctor tell,really feel depressed,quit scenario,quit scenario repeated,much blood give,quit warfarin started,substance resulting,substance resulting small,wet substance resulting,reason discontinued use,quit taking altogether,reason forgot,much fun reason,quite sure thing,much done day,nearly healthy,nearly healthy mentally,realize severe,reason stopped taking,reason stopped,realize pulling,realize pulling even,much coffee made,realize severe anxiety,really weird way,much coffee,really well taking,reason forgot take,reason discontinued,mental focus submitting,mental focus
0,71.974762,62.70226,57.838698,57.114266,55.32922,52.28423,50.247639,50.216333,49.028411,46.310573,45.557485,41.870218,40.678125,40.411878,39.897685,38.878173,38.628507,37.770989,35.598845,35.582818,34.347453,33.321561,32.97327,32.601669,32.003928,31.67047,31.517658,31.439981,31.225123,30.099893,29.675387,29.598504,29.485145,29.385297,28.892024,28.54432,28.525362,28.294834,27.998296,27.925632,...,0.061973,0.061973,0.061973,0.061058,0.061058,0.061058,0.061058,0.061058,0.061058,0.061058,0.061058,0.061058,0.061058,0.061058,0.061058,0.059557,0.059557,0.059557,0.059015,0.059015,0.059015,0.059015,0.059015,0.059015,0.059015,0.059015,0.059015,0.059015,0.059015,0.059015,0.059015,0.059015,0.059015,0.059015,0.059015,0.059015,0.059015,0.059015,0.055737,0.055737


Create the lexicon by calculating Term Frequency - Infrequent Document Frequency (TF-IDF). We will save the terms with the top 600 TF-IDF scores as the vocabulary chosen as input features for this model.

In [18]:
from sklearn.feature_extraction.text import TfidfVectorizer

def tfidf_lexicon(train_df,size=400):
    tv = TfidfVectorizer(ngram_range = (1,3),
                         sublinear_tf = True,
                         max_features = 40000)
    train_tv = tv.fit_transform(train_df['combinedReview'])
    vocab = tv.get_feature_names()
    dist = np.sum(train_tv, axis=0)
    vocab_tfidf_df = pd.DataFrame(dist,columns = vocab)
    sorted_tfidf_df = vocab_tfidf_df.sort_values(vocab_tfidf_df.first_valid_index(),axis=1,ascending=False)
    i = 0
    target_vocab = []
    for (word, score) in sorted_tfidf_df.iteritems():
        target_vocab.append((word, score.values[0]))
        i += 1
        if i == size:
            break
    return target_vocab

size = 600
lexicon_tfidf = tfidf_lexicon(train_df, size)
lexicon = [l[0] for l in lexicon_tfidf]

print("Target vocabulary to use for text classification are the words with the top "+ str(size) + " TF-IDF scores:")
print(', '.join(term for term in lexicon))

Target vocabulary to use for text classification are the words with the top 600 TF-IDF scores:
day, effect, side, take, side effect, taking, drug, time, pain, week, medication, treatment, year, month, also, took, pill, skin, none, first, night, get, sleep, daily, feel, hour, started, doctor, symptom, felt, dose, work, depression, taken, better, much, severe, prescribed, still, morning, back, use, able, help, feeling, headache, anxiety, acne, problem, benefit, well, weight, blood, stopped, dry, tablet, helped, even, used, increased, effective, went, reduced, life, stomach, dosage, mild, loss, made, infection, experienced, really, due, period, good, long, face, however, never, little, control, nausea, result, mood, level, noticed, normal, patient, tried, bad, mouth, pressure, got, stop, needed, medicine, thing, body, using, twice, worked, make, became, eye, hair, experience, high, need, almost, see, away, think, caused, migraine, found, completely, longer, several, began, know, food, goi

Building the classifications which maps distinct classes from the dataset to integers

In [20]:
def build_classifications(df):
    classification_names = list(df.effectiveness.unique())
    classifications = {}
    id = 1
    for classification in classification_names:
        classifications[classification] = id
        id += 1
    return classifications

classifications = build_classifications(train_df)
print(classifications)

{'Highly Effective': 1, 'Marginally Effective': 2, 'Ineffective': 3, 'Considerably Effective': 4, 'Moderately Effective': 5}


Building the final feature sets to use for the Random Forest model. The final feature set is converted into a vector representing the frequency of words from the targeted vocabulary that are present in each review. Since the vocabulary is 600 words long, the vector representation for each review is also 600 numbers long. Taking the target vocabulary that is printed in the code block above, the first 3 vocab words are day, effect and side.
If the review text is `I experience this side effect on a day to day basis` then the vector representation would be [2,1,2] because `day` appears twice and `side` and `effect` appear once only.

*   `X_train` contains the vector representation of the reviews from the training dataset. The model will use this dataset to train.
*   `y_train` represents the classification for that review. Each review is represented by one number. For example, if the `effectiveness` of a drug review is `Highly Effective`, then this is represented in the `y_train` dataset as `1`. There are 3097 reviews in the final training dataset, so `y_train` would simply be a list of 3097 numbers, ranging from 1 to 5 for the different classes.
The model will use this dataset to train.

*   `X_test` contains the vector representation of the reviews from the test dataset. After the model is trained, it will use this as input features to try and predict the effectiveness.
*   `y_test` contains the actual effectiveness for the reviews in the test dataset. After the model has predicted the classifications, its guesses will be compared against this dataset to measure its accuracy, precision and recall later.





In [24]:
from nltk.tokenize import word_tokenize
import nltk
nltk.download('punkt')

def build_featureset(df,lexicon,classifications):
    featureset = []

    # N-gram terms from the lexicon kept as a separate list
    ngrams = [ngram for ngram in lexicon if (len(ngram.split()) > 1)]

    for row in df.itertuples():
        combined_drug_review = df.loc[row.Index, "combinedReview"]
        # If there is no review available for a row, no need to process it, continue on processing the next line
        if pd.isnull(combined_drug_review):
            continue
        word_tokens = word_tokenize(combined_drug_review.lower())
        features = np.zeros(len(lexicon))
        classification = classifications[row.effectiveness]
        for i in range(len(word_tokens)):
            # Check match to n-grams first
            for ngram in ngrams:
                if ngram in combined_drug_review:
                    index_value = lexicon.index(ngram.lower())
                    features[index_value] += 1

            # Matching one word terms
            if word_tokens[i].lower() in lexicon:
                index_value = lexicon.index(word_tokens[i].lower())
                features[index_value] += 1

            features = list(features)
            featureset.append([features, classification])
    return featureset


train_featureset = build_featureset(train_df,lexicon,classifications)
test_featureset = build_featureset(test_df,lexicon,classifications)

X_train = [features[0] for features in train_featureset]
y_train = [features[1] for features in train_featureset]
X_test = [features[0] for features in test_featureset]
y_test = [features[1] for features in test_featureset]

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


## Building the model



In [28]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import StratifiedKFold
from sklearn import metrics

#n_estimators defines the number of trees for the random forest
regressor = RandomForestRegressor(n_estimators=1,random_state=0,verbose=2)
regressor.fit(X_train, y_train)

# # The random forest model returns a matrix of floats, this needs to be transformed into integers
# # to define the clear cut classifications so that accuracy can be calculated.
# y_pred_float = regressor.predict(X_test)
# y_pred = []
# for y in y_pred_float:
#    y_pred.append(round(y))



cv = StratifiedKFold(n_splits=5, random_state=123, shuffle=True)
results = pd.DataFrame(columns=['training_score', 'test_score'])
fprs, tprs, scores = [], [], []
    
for (train, test), i in zip(cv.split(X, y), range(5)):
    regressor.fit(X.iloc[train], y.iloc[train])
    _, _, auc_score_train = compute_roc_auc(train)
    fpr, tpr, auc_score = compute_roc_auc(test)
    scores.append((auc_score_train, auc_score))
    fprs.append(fpr)
    tprs.append(tpr)


print("\nRandom Forest Accuracy:", metrics.accuracy_score(y_test, y_pred))

print("Random forest confusion matrix: \n")
print(metrics.confusion_matrix(y_test,y_pred))

print("\n Random forest precision: " )

print("\nRandom forest precision score:")
print(metrics.precision_score(y_test,y_pred,average='weighted'))
print("\nRandom forest recall score:")
print(metrics.recall_score(y_test,y_pred,average='weighted'))
print("\nRandom forest f1 score:")
print(metrics.f1_score(y_test,y_pred,average='weighted'))

print("classification report:")
print(metrics.classification_report(y_test, y_pred))

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


building tree 1 of 1


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:   39.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:   39.0s finished


NameError: ignored

Convert train and test dataframes into Tensorflow datasets. Split the full train dataset into validation and test sets

In [None]:
# A utility method to create a tf.data dataset from a Pandas Dataframe
def df_to_dataset(dataframe, shuffle=True, batch_size=32):
  dataframe = dataframe.copy()
  labels = dataframe.pop('label')
  ds = tf.data.Dataset.from_tensor_slices((dict(dataframe), labels))
  if shuffle:
    ds = ds.shuffle(buffer_size=len(dataframe))
  ds = ds.batch(batch_size)
  return ds

full_train_tfds = df_to_dataset(train_df)
test_tfds = df_to_dataset(test_df)

full_train_tfds.shuffle(32)
def is_val(x, y):
    return x % 4 == 0

def is_train(x, y):
    return not is_val(x, y)

recover = lambda x,y: y

val_tfds = full_train_tfds.enumerate() \
                    .filter(is_val) \
                    .map(recover)

train_tfds = full_train_tfds.enumerate() \
                    .filter(is_train) \
                    .map(recover)

ValueError: ignored

Optional: Run the lines of code below you would like to view examples of the features and labels created in the Tensorflow Dataset for one batch of data

In [None]:
for feature_batch, label_batch in train_tfds.take(1):
  print('Every feature:', list(feature_batch.keys()))
  print('A batch of combinedReviews:', feature_batch['combinedReview'])
  print('A batch of targets:', label_batch )

NameError: ignored

## Build the model

Creating a Keras layer using a pre-trained model from TensorFlow Hub to convert the `combined reviews` into embeddings. The embedding converts each `combined review` into a 20 dimension array (despite the length and contents of the review). An example of a sentence embedding is printed here.

In [None]:
embedding = "https://tfhub.dev/google/tf2-preview/gnews-swivel-20dim/1"
hub_layer = hub.KerasLayer(embedding, input_shape=[], 
                           dtype=tf.string, trainable=True)
a = hub_layer(feature_batch['combinedReview'])
# Prints one 20 dimensional embedding array 
a[0]














<tf.Tensor: shape=(20,), dtype=float32, numpy=
array([ 3.9738226, -2.4324079,  1.7600758,  1.7758725, -4.506548 ,
       -4.833144 , -3.5563126,  4.78504  ,  4.196502 , -0.4463017,
       -3.0088549,  4.280519 ,  0.5396668,  0.6327668, -6.863371 ,
        2.5921571,  5.555472 , -3.1780574, -3.8864355, -2.6467786],
      dtype=float32)>

In [None]:
label_batch

<tf.Tensor: shape=(32,), dtype=float64, numpy=
array([3., 2., 4., 1., 3., 3., 2., 1., 5., 5., 1., 1., 2., 5., 3., 3., 1.,
       3., 3., 2., 5., 3., 3., 5., 5., 1., 5., 1., 3., 2., 1., 3.])>

Build the model using a pre-trained model from Tensorflow. This model contains text embeddings which is trained on English Google News (130GB corpus). 


https://tfhub.dev/google/tf2-preview/gnews-swivel-20dim/1

After the model is built a summary is printed.

In [None]:
model = tf.keras.Sequential()
model.add(hub_layer)
model.add(tf.keras.layers.Dense(16, activation='relu'))
model.add(tf.keras.layers.Dense(1))

model.summary() 

Model: "sequential_6"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
keras_layer_19 (KerasLayer)  (None, 20)                400020    
_________________________________________________________________
dense_12 (Dense)             (None, 16)                336       
_________________________________________________________________
dense_13 (Dense)             (None, 1)                 17        
Total params: 400,373
Trainable params: 400,373
Non-trainable params: 0
_________________________________________________________________


Building a loss function and optimizer for training

In [None]:
model.compile(optimizer='adam',
              loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
              metrics=['accuracy'])

## Train the model

Training the model over 20 epochs in batches of 32 using the train and validation datasets. The model's loss and accuracy will be monitored over 10,000 samples from the validation set.

In [None]:
hub_layer.input_shape

In [None]:
#history = model.fit(train_tfds.shuffle(10000).batch(512),epochs=20,validation_data=val_tfds.batch(32), verbose=1)

history = model.fit(train_tfds.shuffle(10000).batch(512),
                    epochs=20,
                    validation_data=val_tfds.batch(512),
                    verbose=1)

Epoch 1/20


  [n for n in tensors.keys() if n not in ref_input_names])


ValueError: ignored

## Evaluate the model

Evaluate the model performance on the test dataset.

## Acknowledgements

This code was written using the Tensorflow tutorial documentation as a guide