In [1]:
# import packages
import pandas as pd
import boto3

# using this tutorial 
# https://www.tensorflow.org/tutorials/keras/text_classification_with_hub

# uses tf.keras, a high-level API to build and train models in TensorFlow, and TensorFlow Hub, a library and platform for transfer learning

In [2]:
from __future__ import absolute_import, division, print_function, unicode_literals

import numpy as np
from sklearn.model_selection import train_test_split

import tensorflow as tf

!pip install -q tensorflow-hub
!pip install -q tensorflow-datasets
import tensorflow_hub as hub
import tensorflow_datasets as tfds

print("Version: ", tf.__version__)
print("Eager mode: ", tf.executing_eagerly())
print("Hub version: ", hub.__version__)
print("GPU is", "available" if tf.config.experimental.list_physical_devices("GPU") else "NOT AVAILABLE")

Version:  2.0.0
Eager mode:  True
Hub version:  0.7.0
GPU is NOT AVAILABLE


In [42]:
# load data 
data = pd.read_csv("data/train.csv")
data = data[['target', 'question_text']]
target = data.pop("target")

# train test split
X_train, X_test, y_train, y_test = train_test_split(data, target, test_size=0.33, random_state=42)

# using CV instead of static train/test split
# https://stackoverflow.com/questions/39748660/how-to-perform-k-fold-cross-validation-with-tensorflow


# load to tf.data --> train
train_data = tf.data.Dataset.from_tensor_slices((X_train.values, y_train.values))

# load to tf.data --> validation
validation_data = tf.data.Dataset.from_tensor_slices((X_test.values, y_test.values))

In [50]:
train_examples_batch, train_labels_batch = next(iter(train_data.batch(10)))

# reshape 
train_examples_batch = tf.reshape(train_examples_batch, [10,])

# batches --> note the shape
train_examples_batch

<tf.Tensor: id=1071, shape=(10,), dtype=string, numpy=
array([b'What are the differences between the Indian work culture and the UK work culture?',
       b'Is this English sentence correct and not weird?',
       b'How much of a "leg up" would the Cornell human ecology be for a student? Worth the amount of money it would cost?',
       b'Will Trump get away with all of his lies and blatantly corrupt behavior?',
       b'What is the meaning of freedom of speech for RSS & BJP person?',
       b'What is the altitude of Genting Highlands?',
       b'Is GPA useful for getting better universities through GRE?',
       b'Can you compare Pakistani with pigs?',
       b'How pashmina shawl is formed?', b'How is murder morally wrong?'],
      dtype=object)>

In [51]:
# labels
train_labels_batch

<tf.Tensor: id=1069, shape=(10,), dtype=int64, numpy=array([0, 0, 0, 0, 1, 0, 0, 1, 0, 0])>

Build the model

three main architecture decisions:
1) how to represent the data (the text)
2) how many layers to use in the model 
3) how many **hidden** units to use for each layer 

Transfer Learning 
One way to represent the text is to convert sentences into embeddings vectors. We can use a pre-trained text embedding as the first layer, which will have three advantages:

1) we don't have to worry about text preprocessing,
2) we can benefit from transfer learning,
3) the embedding has a fixed size, so it's simpler to process.

https://blog.fastforwardlabs.com/2019/09/05/transfer-learning-from-the-ground-up.html






For this example we will use a pre-trained text embedding model from TensorFlow Hub called google/tf2-preview/gnews-swivel-20dim/1.

There are three other pre-trained models to test for the sake of this tutorial:

1) google/tf2-preview/gnews-swivel-20dim-with-oov/1 - same as google/tf2-preview/gnews-swivel-20dim/1, but with 2.5% vocabulary converted to OOV buckets. This can help if vocabulary of the task and vocabulary of the model don't fully overlap.

2) google/tf2-preview/nnlm-en-dim50/1 - A much larger model with ~1M vocabulary size and 50 dimensions.

3) google/tf2-preview/nnlm-en-dim128/1 - Even larger model with ~1M vocabulary size and 128 dimensions.

Let's first create a Keras layer that uses a TensorFlow Hub model to embed the sentences, and try it out on a couple of input examples. 

Note that no matter the length of the input text, the output shape of the embeddings is: (num_examples, embedding_dimension)

In [52]:
embedding = "https://tfhub.dev/google/tf2-preview/gnews-swivel-20dim/1"
hub_layer = hub.KerasLayer(embedding, input_shape=[], 
                           dtype=tf.string, trainable=True)

In [57]:
# projection as the average of embedding dimensions?
# example --> 'What are the differences between the Indian work culture and the UK work culture?'
hub_layer(train_examples_batch[:1])


<tf.Tensor: id=1340, shape=(1, 20), dtype=float32, numpy=
array([[ 1.3658514 ,  1.2020671 ,  2.1172225 , -1.238663  , -0.13164836,
        -0.79073864,  0.85393584, -0.27051654, -0.028162  , -1.0663986 ,
         0.68132824, -1.1612318 , -1.7176561 ,  0.5895182 , -1.7590446 ,
        -0.3582189 ,  1.6168668 ,  0.21547578, -0.8037966 , -0.29350185]],
      dtype=float32)>

In [55]:
# building full model 

model = tf.keras.Sequential()
model.add(hub_layer)
model.add(tf.keras.layers.Dense(16, activation='relu'))
model.add(tf.keras.layers.Dense(1, activation='sigmoid'))

model.summary()


Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
keras_layer_4 (KerasLayer)   (None, 20)                400020    
_________________________________________________________________
dense (Dense)                (None, 16)                336       
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 17        
Total params: 400,373
Trainable params: 400,373
Non-trainable params: 0
_________________________________________________________________
