<h1> Text Classification using TensorFlow/Keras on Cloud ML Engine </h1>

We will look at the titles of articles and figure out whether the article came from the New York Times, TechCrunch or GitHub. 

We will use [hacker news](https://news.ycombinator.com/) as our data source. It is an aggregator that displays tech related headlines from various  sources.

This notebook illustrates:
<ol>
<li> Creating datasets for Machine Learning using BigQuery
<li> Using TF Hub for transfer learning
<li> Creating a sentence level text classification model using Keras
<li> Creating a word lelvel text classification model using Keras
</ol>

In [1]:
# change these to try this notebook out
BUCKET = 'vijays-sandbox-ml'
PROJECT = 'vijays-sandbox'
REGION = 'us-central1'
SEED = 0

## GPU Strongly Recommended

This entire notebook will run in under 10 minutes using a V100 GPU, but will take about 3 hours on CPU

You can add a GPU to your AI Platform Notebook instance following [these instructions](https://cloud.google.com/ml-engine/docs/notebooks/manage-hardware-accelerators).

After adding the subsequent cell should print "GPU Enabled: True". To manage costs, you can remove the GPU after completing the lab.

In [2]:
import tensorflow as tf
print(tf.__version__) # tested with tf 2.0.0-beta1
print('GPU Enabled: {}'.format(tf.test.is_gpu_available())) # GPU Recommended

2.0.0-beta1
GPU Enabled: True


# Create Dataset from BigQuery 

Hacker news headlines are available as a BigQuery public dataset. The [dataset](https://bigquery.cloud.google.com/table/bigquery-public-data:hacker_news.stories?tab=details) contains all headlines from the sites inception in October 2006 until October 2015. 

Here is a sample of the dataset:

In [3]:
%%bigquery --project $PROJECT
SELECT
  url, title, score
FROM
  `bigquery-public-data.hacker_news.stories`
WHERE
  LENGTH(title) > 10
  AND score > 10
  AND LENGTH(url) > 0
LIMIT 10

Unnamed: 0,url,title,score
0,http://alsop-louie.com/portfolio/portfolio-rev...,Portfolio Review: Justin.tv,11
1,http://torrentfreak.com/dutch-isps-ordered-to-...,Dutch ISPs ordered to block The Pirate Bay,11
2,http://devstand.com/2012/02/03/impressive-3d-h...,Stunning 3D Examples of HTML5 Artwork,11
3,http://www.erlang-factory.com/conference/SFBay...,2010 SF Bay Area Erlang Factory Programme,11
4,http://www.businessinsider.com/dollar-shave-cl...,"Razors, what a DEAL",11
5,http://zorter.com/rankers/yc,Show HN: Review Zorter and find the best YC co...,11
6,http://blog.tippingpointlabs.com/2009/09/12sec...,What Any Startup or VC Can Learn from 12Second...,11
7,http://www.youtube.com/watch?v=m3giY2eI65w,Impeach Obama for NSA Spying Program?,11
8,http://thebottomline.cpaaustralia.com.au/,60 Minute Interview with Neil Armstrong,11
9,http://codebrief.com/2012/01/the-top-10-javasc...,Top Javascript MVC Frameworks Reviewed,11


Let's do some regular expression parsing in BigQuery to get the source of the newspaper article from the URL. For example, if the url is http://mobile.nytimes.com/...., I want to be left with <i>nytimes</i>

In [4]:
%%bigquery --project $PROJECT
SELECT
  ARRAY_REVERSE(SPLIT(REGEXP_EXTRACT(url, '.*://(.[^/]+)/'), '.'))[OFFSET(1)] AS source,
  COUNT(title) AS num_articles
FROM
  `bigquery-public-data.hacker_news.stories`
WHERE
  REGEXP_CONTAINS(REGEXP_EXTRACT(url, '.*://(.[^/]+)/'), '.com$')
  AND LENGTH(title) > 10
GROUP BY
  source
ORDER BY num_articles DESC
LIMIT 10

Unnamed: 0,source,num_articles
0,blogspot,41386
1,github,36525
2,techcrunch,30891
3,youtube,30848
4,nytimes,28787
5,medium,18422
6,google,18235
7,wordpress,17667
8,arstechnica,13749
9,wired,12841


Now that we have good parsing of the URL to get the source, let's put together a dataset of source and titles. This will be our labeled dataset for machine learning.

In [5]:
from google.cloud import bigquery
bq = bigquery.Client(project=PROJECT)

query="""
SELECT source, LOWER(REGEXP_REPLACE(title, '[^a-zA-Z0-9 $.-]', ' ')) AS title FROM
  (SELECT
    ARRAY_REVERSE(SPLIT(REGEXP_EXTRACT(url, '.*://(.[^/]+)/'), '.'))[OFFSET(1)] AS source,
    title
  FROM
    `bigquery-public-data.hacker_news.stories`
  WHERE
    REGEXP_CONTAINS(REGEXP_EXTRACT(url, '.*://(.[^/]+)/'), '.com$')
    AND LENGTH(title) > 10
  )
WHERE (source = 'github' OR source = 'nytimes' OR source = 'techcrunch')
"""

df = bq.query(query + " LIMIT 5").to_dataframe()
df.head()

Unnamed: 0,source,title
0,github,php bdd is now nice
1,github,mpv video player 0.2 release
2,github,show hn re-thinking the business card with dr...
3,github,update css js from chrome developer tool
4,github,simple way to start development with flask usi...


For ML training, we will need to split our dataset into training and evaluation datasets (and perhaps an independent test dataset if we are going to do model or feature selection based on the evaluation dataset).  

A simple, repeatable way to do this is to use the hash of a well-distributed column in our data (See https://www.oreilly.com/learning/repeatable-sampling-of-data-sets-in-bigquery-for-machine-learning).

In [6]:
traindf = bq.query(query + " AND MOD(ABS(FARM_FINGERPRINT(title)),4) > 0").to_dataframe()
evaldf  = bq.query(query + " AND MOD(ABS(FARM_FINGERPRINT(title)),4) = 0").to_dataframe()

Below we can see that roughly 75% of the data is used for training, and 25% for evaluation. 

We can also see that within each dataset, the classes are roughly balanced.

In [7]:
traindf['source'].value_counts()

github        27445
techcrunch    23131
nytimes       21586
Name: source, dtype: int64

In [8]:
evaldf['source'].value_counts()

github        9080
techcrunch    7760
nytimes       7201
Name: source, dtype: int64

Finally we will save our data, which is currently in-memory, to disk.

In [9]:
import os, shutil
DATADIR='data/txtcls'
shutil.rmtree(DATADIR, ignore_errors=True)
os.makedirs(DATADIR)
traindf.to_csv( os.path.join(DATADIR,'train.tsv'), header=False, index=False, encoding='utf-8', sep='\t')
evaldf.to_csv( os.path.join(DATADIR,'eval.tsv'), header=False, index=False, encoding='utf-8', sep='\t')

In [10]:
!head -3 data/txtcls/train.tsv

github	this guy just found out how to bypass adblocker
github	show hn  dodo   command line task management for developers
github	show hn  webservicemock   mock out external calls for local development


In [11]:
!wc -l data/txtcls/*.tsv

  24041 data/txtcls/eval.tsv
  72162 data/txtcls/train.tsv
  96203 total


# Sentence Level Model with DNN

Now that we have our dataset, we need to represent our text data numerically. [Tensorflow Hub](https://www.tensorflow.org/hub) makes this super easy. It contains a library of pre-trained text embeddings that we can download and use with a few lines of code. 

In particular we will use [this](https://tfhub.dev/google/tf2-preview/nnlm-en-dim128-with-normalization/1) embedding which encodes sentences into 128 dimensional vectors.

Once we have the embedded representation we can simply feed it through a DNN for classification.

In [12]:
import tensorflow as tf
import tensorflow_hub as hub

from tensorflow.python.keras import models
from tensorflow.python.keras.layers import Dense

import pandas as pd
import numpy as np
from google.cloud import storage

CLASSES = {'github': 0, 'nytimes': 1, 'techcrunch': 2}  # label-to-int mapping
MAX_SEQUENCE_LENGTH = 50  # Sentences will be truncated/padded to this length


"""
Parses raw tsv containing hacker news headlines and returns (sentence, integer label) pairs
  # Arguments:
      train_data_path: string, path to tsv containing training data.
      eval_data_path: string, path to tsv containing eval data.
  # Returns:
      ((train_sentences, train_labels), (test_sentences, test_labels)):  sentences
        are lists of strings, labels are numpy integer arrays
"""
def load_hacker_news_data(train_data_path, eval_data_path):
    # Parse CSV using pandas
    column_names = ('label', 'text')
    df_train = pd.read_csv(train_data_path, names=column_names, sep='\t')
    df_eval = pd.read_csv(eval_data_path, names=column_names, sep='\t')

    return ((list(df_train['text']), np.array(df_train['label'].map(CLASSES))),
            (list(df_eval['text']), np.array(df_eval['label'].map(CLASSES))))


"""
Create tf.estimator compatible input function
  # Arguments:
      texts: [strings], list of sentences
      labels: numpy int vector, integer labels for sentences
      batch_size: int, number of records to use for each train batch
      mode: tf.estimator.ModeKeys.TRAIN or tf.estimator.ModeKeys.EVAL 
  # Returns:
      tf.data.dataset, produces feature and label
        tensors one batch at a time
"""
def input_fn(texts, labels, batch_size, mode):
    # Transform text to sequence of integers
    labels = tf.one_hot(labels,len(CLASSES)) #precision and recall metrics require one hot labels
    dataset = tf.data.Dataset.from_tensor_slices((texts, labels))
    
    if mode == tf.estimator.ModeKeys.EVAL:
        return dataset.batch(batch_size)
    else: 
        return dataset.shuffle(50000).batch(batch_size)

"""
Builds a CNN model using keras and converts to tf.estimator.Estimator
  # Arguments
      model_dir: string, file path where training files will be written
      config: tf.estimator.RunConfig, specifies properties of tf Estimator
      filters: int, output dimension of the layers.
      kernel_size: int, length of the convolution window.
      embedding_dim: int, dimension of the embedding vectors.
      dropout_rate: float, percentage of input to drop at Dropout layers.
      pool_size: int, factor by which to downscale input at MaxPooling layer.
      embedding_path: string , file location of pre-trained embedding (if used)
        defaults to None which will cause the model to train embedding from scratch
      word_index: dictionary, mapping of vocabulary to integers. used only if
        pre-trained embedding is provided

    # Returns
        A keras model
"""
def keras_model(learning_rate):
    # Create model instance.
    model = models.Sequential()

    # Add embedding layer
    hub_layer = hub.KerasLayer(
        "https://tfhub.dev/google/tf2-preview/nnlm-en-dim128-with-normalization/1", 
        output_shape=[128], 
        input_shape=[], 
        dtype=tf.string
    )
    model.add(hub_layer)
    model.add(Dense(500,activation='relu'))
    model.add(Dense(100,activation='relu'))
    model.add(Dense(len(CLASSES), activation='softmax'))

    # Compile model with learning parameters.
    optimizer = tf.keras.optimizers.Adam(lr=learning_rate)
    model.compile(
        optimizer=optimizer, 
        loss='categorical_crossentropy', 
        metrics=[
            'accuracy',
            tf.keras.metrics.Precision(),
            tf.keras.metrics.Recall()
        ]
    )

    return model

In [13]:
hparams = {
    'train_data_path':'./data/txtcls/train.tsv',
    'eval_data_path':'./data/txtcls/eval.tsv',
    'batch_size':128,
    'learning_rate':.001
}

# Load Data
((train_texts, train_labels), (test_texts, test_labels)) = load_hacker_news_data(
    hparams['train_data_path'], hparams['eval_data_path'])

model = keras_model(learning_rate=hparams['learning_rate'])

train_dataset = input_fn(
    train_texts,
    train_labels,
    hparams['batch_size'],
    mode=tf.estimator.ModeKeys.TRAIN
)
eval_dataset = input_fn(
    test_texts,
    test_labels,
    hparams['batch_size'],
    mode=tf.estimator.ModeKeys.EVAL
)

In [14]:
%%time
tf.random.set_seed(SEED)
model.fit(
    train_dataset,
    epochs=5,
    validation_data=eval_dataset,
    validation_steps=None
)

W0815 19:01:44.741532 139943026558720 deprecation.py:323] From /usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/math_grad.py:1250: add_dispatch_support.<locals>.wrapper (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where


Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
CPU times: user 42.4 s, sys: 6.2 s, total: 48.6 s
Wall time: 32.7 s


<tensorflow.python.keras.callbacks.History at 0x7f4618c65f60>

### Results

We get 80% validation accuracy. Not bad.

# Word Level Model with CNN

While the above method shines in simplicity, it uses a sentence level embedding which ignores the ordering of words. Might we get better performance if we embedded each word individually then fed them into a sequential model? We test that hypothesis now.

The `hub.KerasLayer()` method doesn't support word level embeddings natively, instead it averages the component word embeddings into a single sentence embedding, so to achieve what we want we must do it upfront in the `input_fn()`. In particular we:
1. Split each sentence into a list of its component words
2. Pad each list to a constant length
3. Embed each word into 128 dimension vector representation

Note the changes to the `input_fn()` below.

Since input function now returns a sequence of word embeddings, so we can process the data using a sequential model. Specifically we'll use a 1D CNN. Note the changes to `keras_model()` below.

In [15]:
from tensorflow.python.keras.layers import Dropout
from tensorflow.python.keras.layers import Conv1D
from tensorflow.python.keras.layers import MaxPooling1D
from tensorflow.python.keras.layers import GlobalAveragePooling1D

"""
Create tf.estimator compatible input function
  # Arguments:
      texts: [strings], list of sentences
      labels: numpy int vector, integer labels for sentences
      tokenizer: tf.python.keras.preprocessing.text.Tokenizer
        used to convert sentences to integers
      batch_size: int, number of records to use for each train batch
      mode: tf.estimator.ModeKeys.TRAIN or tf.estimator.ModeKeys.EVAL 
  # Returns:
      tf.data.dataset, produces feature and label
        tensors one batch at a time
"""
def input_fn(texts, labels, batch_size, mode):
    #precision and recall metrics require one hot labels
    labels = tf.one_hot(labels,len(CLASSES)) 
    #split sentences into lists of words
    texts = [sentence.split() for sentence in texts] 
    # pad to constant length
    texts = [(sentence + MAX_SEQUENCE_LENGTH * ['<PAD>'])[:MAX_SEQUENCE_LENGTH] for sentence in texts] 
    #embed
    embed = hub.load("https://tfhub.dev/google/tf2-preview/nnlm-en-dim128-with-normalization/1")
    texts = [embed(sentence) for sentence in texts]

    dataset = tf.data.Dataset.from_tensor_slices((texts, labels))
    
    if mode == tf.estimator.ModeKeys.EVAL:
        return dataset.batch(batch_size)
    else: 
        return dataset.shuffle(50000).batch(batch_size)

"""
Builds a CNN model using keras 
  # Arguments
      model_dir: string, file path where training files will be written
      config: tf.estimator.RunConfig, specifies properties of tf Estimator
      filters: int, output dimension of the layers.
      kernel_size: int, length of the convolution window.
      embedding_dim: int, dimension of the embedding vectors.
      dropout_rate: float, percentage of input to drop at Dropout layers.
      pool_size: int, factor by which to downscale input at MaxPooling layer.


    # Returns
        A tf.estimator.Estimator 
"""
def keras_model(learning_rate, filters=64, dropout_rate=0.2, kernel_size=3, pool_size=3):
    # Create model instance.
    model = models.Sequential()

    model.add(Dropout(input_shape=(MAX_SEQUENCE_LENGTH,128),rate=dropout_rate))
    model.add(Conv1D(
        filters=filters,
        kernel_size=kernel_size,
        activation='relu',
        bias_initializer='random_uniform',
        padding='same'))
    model.add(MaxPooling1D(pool_size=pool_size))
    model.add(Conv1D(
        filters=filters * 2,
        kernel_size=kernel_size,
        activation='relu',
        bias_initializer='random_uniform',
        padding='same'))
    model.add(GlobalAveragePooling1D())
    model.add(Dropout(rate=dropout_rate))
    model.add(Dense(len(CLASSES), activation='softmax'))

    # Compile model with learning parameters.
    optimizer = tf.keras.optimizers.Adam(lr=learning_rate)
    model.compile(
        optimizer=optimizer, 
        loss='categorical_crossentropy', 
        metrics=[
            'accuracy',
            tf.keras.metrics.Precision(),
            tf.keras.metrics.Recall()
        ]
    )
    return model

**The subsequent cell takes ~ 3 hours on CPU, about ~ 6 minutes on a P100 GPU, and ~ 4 minutes on a V100 GPU**

This takes so long because now we are doing a lot of pre-processing in the input function.


In [16]:
%%time
train_dataset = input_fn(
    train_texts,
    train_labels,
    hparams['batch_size'],
    mode=tf.estimator.ModeKeys.TRAIN
)
eval_dataset = input_fn(
    test_texts,
    test_labels,
    hparams['batch_size'],
    mode=tf.estimator.ModeKeys.EVAL
)

CPU times: user 7min 12s, sys: 1min 27s, total: 8min 40s
Wall time: 6min 20s


In [17]:
%%time
model = keras_model(learning_rate=hparams['learning_rate'])

tf.random.set_seed(SEED)
model.fit(
    train_dataset,
    epochs=5,
    validation_data=eval_dataset,
    validation_steps=None
)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
CPU times: user 1min 1s, sys: 12.9 s, total: 1min 14s
Wall time: 57.4 s


<tensorflow.python.keras.callbacks.History at 0x7f45d42afe80>

Our accuracy improved to 83%! Looks like paying attention to word order does help.

# Save Trained Model

In [18]:
model.save('headline_classification_model.h5')

# Deploy trained model 

See [deploy.ipynb](deploy.ipynb)

Copyright 2019 Google Inc. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License