# Deep Products: Deep Tag Labeler

This is the first project for the book Deep Products, about using NLP and weakly supervised learning to build complete machine learning products. Using the non-code text of Stack Overflow posts (question and answers) to tag them using a multi-class, multi-label classifier using LSTMs and Emlo embeddings.

In [1]:
import os
import re

from keras import backend as K
import numpy as np
import pandas as pd
import tensorflow as tf
import tensorflow_hub as hub
from tqdm import tqdm_notebook

Using TensorFlow backend.


## Load 14 Million Answered Questions with Tags from Stack Overflow

We load all answered questions from Stack Overflow. This data was converted from XML to JSON and then sampled using Spark on a single `r5.12xlarge` machine cluster with [code/stackoverflow/sample_json.spark.py](stackoverflow/sample_json.spark.py).

In [2]:
posts_df = pd.read_parquet(
    'data/stackoverflow/parquet/Questions.Answered.parquet',
    columns=['_Body', '_Tags'],
    filters=[('_Tags','!=',None),],
    engine='pyarrow'
)
posts_df.head(5)

Unnamed: 0,_Body,_Tags
0,<p>I want to use a track-bar to change a form'...,<c#><floating-point><type-conversion><double><...
1,<p>I have an absolutely positioned <code>div</...,<html><css><css3><internet-explorer-7>
2,<p>Given a <code>DateTime</code> representing ...,<c#><.net><datetime>
3,<p>Given a specific <code>DateTime</code> valu...,<c#><datetime><time><datediff><relative-time-s...
4,<p>Is there a standard way for a web server to...,<html><browser><timezone><user-agent><timezone...


In [3]:
posts_df = posts_df.head(1000000)
posts_df.head(5)

Unnamed: 0,_Body,_Tags
0,<p>I want to use a track-bar to change a form'...,<c#><floating-point><type-conversion><double><...
1,<p>I have an absolutely positioned <code>div</...,<html><css><css3><internet-explorer-7>
2,<p>Given a <code>DateTime</code> representing ...,<c#><.net><datetime>
3,<p>Given a specific <code>DateTime</code> valu...,<c#><datetime><time><datediff><relative-time-s...
4,<p>Is there a standard way for a web server to...,<html><browser><timezone><user-agent><timezone...


## Drop Unlabeled Posts

Note: these have already been filtered to remove untagged questions, so there are from 1-5 labels per post.

In [4]:
tag_posts = posts_df.dropna(axis=0, subset=['_Tags'])
print('Posts w/ tags: {:,}'.format(len(tag_posts.index)))
tag_posts.head(5)

Posts w/ tags: 1,000,000


Unnamed: 0,_Body,_Tags
0,<p>I want to use a track-bar to change a form'...,<c#><floating-point><type-conversion><double><...
1,<p>I have an absolutely positioned <code>div</...,<html><css><css3><internet-explorer-7>
2,<p>Given a <code>DateTime</code> representing ...,<c#><.net><datetime>
3,<p>Given a specific <code>DateTime</code> valu...,<c#><datetime><time><datediff><relative-time-s...
4,<p>Is there a standard way for a web server to...,<html><browser><timezone><user-agent><timezone...


## Extract the Tags from their XML tags

In [6]:
tag_posts['_Tag_List'] = tag_posts['_Tags'].apply(lambda x: re.findall('\<(.+?)\>', x))

flat_tags = tag_posts.apply(lambda x: pd.Series(x['_Tag_List']),axis=1).stack().reset_index(level=1, drop=True)
flat_tags.head(5)

## Count the Most Common Tags

In [None]:
# tag_counts = flat_tags.groupby(flat_tags).count().sort_values(ascending=False)
# print(tag_counts[0:10])

## Check the Distribution of Tag Counts

This is split between two charts for clarity, tag counts <100 and >=100. The data is highly skewed, which is an issue we must address.

In [None]:
# %matplotlib inline
# import matplotlib.pyplot as plt

# tag_counts[tag_counts <= 100].hist(bins=50)

In [None]:
# tag_counts[tag_counts > 100].hist(bins=50)

## Try Different Thresholds for Filtering Tags by Frequency

The higher the threshold, the fewer classes, the less sparse the data, the easier the learning task.

In [7]:
from collections import defaultdict

tag_counts = defaultdict(int)

for row in tag_posts['_Tag_List']:
    for tag in row:
        tag_counts[tag] += 1

for i in [0, 10, 20, 50, 100, 1000, 5000]:
    filtered_tags = list(filter(lambda x: x > i, tag_counts.values()))
    print('There are {:,} tags with more than {:,} count'.format(len(filtered_tags), i))

MIN_TAGS = 5000

record_count = len([i for i in filter(lambda x: x > MIN_TAGS, tag_counts.values())])
record_count

There are 23,795 tags with more than 0 count
There are 9,706 tags with more than 10 count
There are 6,660 tags with more than 20 count
There are 3,901 tags with more than 50 count
There are 2,508 tags with more than 100 count
There are 416 tags with more than 1,000 count
There are 70 tags with more than 5,000 count


70

## Map from Tags to IDs

In [8]:
all_tags = set()
for row in tag_posts['_Tag_List']:
    for tag in row:
        if tag_counts[tag] > MIN_TAGS:
            all_tags.add(tag)

print('Total unique tags with {:,} occurrences: {:,}'.format(MIN_TAGS, len(all_tags)))
sorted_all_tags = sorted(all_tags)

tag_to_id = {val:i for i, val in enumerate(sorted_all_tags)}
id_to_tag = {i:val for i, val in enumerate(sorted_all_tags)}

Total unique tags with 5,000 occurrences: 70


## One Hot Encode Tag Lists

In [9]:
labels = []
tag_list = tag_posts['_Tag_List'].tolist()

# Loop through every post...
for i, tag_set in enumerate(tag_posts['_Tag_List'].tolist()):
    # Then build a record_count element wide list for each tag present
    label_row = []
    for tag in sorted_all_tags:
        if tag in tag_list[i]:
            label_row.append(1)
        else:
            label_row.append(0)
    labels.append(label_row)
    
tag_labels = [id_to_tag[key_id] for key_id in sorted(id_to_tag.keys()) if tag_counts[id_to_tag[key_id]] > MIN_TAGS]

len(labels), len(labels[0]), len(tag_labels)

(1000000, 70, 70)

## Extract/Tokenize Non-Code Text from Posts

We leave posts' source code out for now because it will need a different embedding and thus multiple inputs.

In [21]:
from multiprocessing import  Pool

from bs4 import BeautifulSoup
from sklearn.feature_extraction.text import TfidfVectorizer
from keras.preprocessing.text import Tokenizer

def parallelize_series(s, func, n_cores=4):
    s_split = np.array_split(s, n_cores)
    pool = Pool(n_cores)
    s = pd.concat(pool.map(func, s_split))
    pool.close()
    pool.join()
    return s

N_CORES = 12
MAX_LEN = 100
PAD_TOKEN = '__PAD__'
BATCH_SIZE = 64

def extract_text(s):
    """Extract non-code text from posts (questions/answers)"""
    text = []
    for body in s:
        doc = BeautifulSoup(body)
        codes = doc.find_all('code')
        [code.extract() if code else None for code in codes]
        tokens = doc.text.split()
        padded_tokens = [tokens[i].lower() if len(tokens) > i else PAD_TOKEN for i in range(0,MAX_LEN)]
        text.append(padded_tokens)
    return pd.Series(text)

post_text = parallelize_series(tag_posts._Body, extract_text, n_cores=N_CORES).reset_index(drop=True)
#post_text = tag_posts._Body.apply(extract_text).reset_index(drop=True)
post_text.head(5)

0    [i, want, to, use, a, track-bar, to, change, a...
1    [i, have, an, absolutely, positioned, containi...
2    [given, a, representing, a, person's, birthday...
3    [given, a, specific, value,, how, do, i, displ...
4    [is, there, a, standard, way, for, a, web, ser...
dtype: object

In [22]:
len(post_text.index), len(post_text.iloc[0]), len(labels), len(labels[0])

(1000000, 100, 1000000, 70)

In [23]:
# Validate the posts match the labels
assert(len(post_text.index) == len(labels))
print('We are left with {:,} example posts'.format(len(post_text.index)))

We are left with 1,000,000 example posts


## Make Record Count a Multiple of the Batch Size and Post Sequence Length

The Elmo embedding requires that the number of records be a multiple of the batch size times the number of tokens in the padded posts.

In [24]:
import math

# Filter label rows that don't have any positive labels
label_mx = np.array(labels)
max_per_row = label_mx.max(axis=1)
non_zero_index = np.nonzero(max_per_row)[0]

label_mx = label_mx[non_zero_index]

# Filter the posts to match
post_text = post_text[post_text.index.isin(non_zero_index)]
post_text = post_text.to

assert(post_text.shape[0] == label_mx.shape[0])
print('Unfiltered Counts: {:,} {:,}'.format(post_text.shape[0], label_mx.shape[0]))

# training_count must be a multiple of the BATCH_SIZE times the MAX_LEN for the Elmo embedding layer
highest_factor = math.floor(post_text.shape[0] / (BATCH_SIZE * MAX_LEN))
training_count = highest_factor * BATCH_SIZE * MAX_LEN
print('Highest Factor: {:,} Training Count: {:,}'.format(highest_factor, training_count))

label_mx = label_mx[0:training_count]
post_text = post_text[0:training_count]

assert(post_text.shape[0] == label_mx.shape[0])
print('Final Counts: {:,} {:,}'.format(post_text.shape[0], label_mx.shape[0]))

MemoryError: 

In [15]:
post_text

array([['this', 'is', 'what', ..., '__PAD__', '__PAD__', '__PAD__'],
       ['hi', 'this', 'script', ..., '__PAD__', '__PAD__', '__PAD__'],
       ["i'm", 'creating', 'an', ..., 'figure', 'out', 'how'],
       ...,
       ['i', 'want', 'to', ..., '__PAD__', '__PAD__', '__PAD__'],
       ['assertjson', 'fails', 'even', ..., '__PAD__', '__PAD__',
        '__PAD__'],
       ['i', 'am', 'doing', ..., '__PAD__', '__PAD__', '__PAD__']],
      dtype='<U1356')

## Create an Elmo Embedding Layer using Tensorflow Hub

Note that this layer takes a padded two-dimensional array of strings.

In [16]:
# From https://www.depends-on-the-definition.com/named-entity-recognition-with-residual-lstm-and-elmo/

sess = tf.Session()
K.set_session(sess)

elmo_model = hub.Module("https://tfhub.dev/google/elmo/2", trainable=True)

sess.run(tf.global_variables_initializer())
sess.run(tf.tables_initializer())

def ElmoEmbedding(x):
    return elmo_model(inputs={
                            "tokens": tf.squeeze(tf.cast(x, tf.string)),
                            "sequence_len": tf.constant(BATCH_SIZE*[MAX_LEN])
                      },
                      signature="tokens",
                      as_dict=True)["elmo"]

W0722 15:34:57.517402 140412342794048 module_wrapper.py:136] From /home/rjurney/anaconda/envs/deep/lib/python3.6/site-packages/tensorflow_core/python/util/module_wrapper.py:163: The name tf.Session is deprecated. Please use tf.compat.v1.Session instead.



## Experimental Setup

We `train_test_split` rather than k-fold cross validate because it is too expensive.

In [17]:
from sklearn.model_selection import train_test_split

TEST_SPLIT = 0.15

X_train, X_test, y_train, y_test = train_test_split(
    post_text,
    label_mx,
    test_size=TEST_SPLIT,
    random_state=34
)

X_train.shape, X_test.shape, y_train.shape, y_test.shape

((59840, 100), (10560, 100), (59840, 33), (10560, 33))

## Create an LSTM Model to Classify Posts into Tags

We use the padded/tokenized posts as input, an Elmo embedding feeding an Long-Short-Term-Memory (LSTM) layer followed by a Dense layer with the same number of output neurons as our tag list.

We use focal loss as a loss function, which is used in appliations like object detection, because it 

In [18]:
def recall_m(y_true, y_pred):
        true_positives = K.sum(K.round(K.clip(y_true * y_pred, 0, 1)))
        possible_positives = K.sum(K.round(K.clip(y_true, 0, 1)))
        recall = true_positives / (possible_positives + K.epsilon())
        return recall

def precision_m(y_true, y_pred):
        true_positives = K.sum(K.round(K.clip(y_true * y_pred, 0, 1)))
        predicted_positives = K.sum(K.round(K.clip(y_pred, 0, 1)))
        precision = true_positives / (predicted_positives + K.epsilon())
        return precision

def f1_m(y_true, y_pred):
    precision = precision_m(y_true, y_pred)
    recall = recall_m(y_true, y_pred)
    return 2*((precision*recall)/(precision+recall+K.epsilon()))

def focal_loss(y_true, y_pred):
    gamma = 2.0
    alpha = 0.25
    pt_1 = tf.where(tf.equal(y_true, 1), y_pred, tf.ones_like(y_pred))
    pt_0 = tf.where(tf.equal(y_true, 0), y_pred, tf.zeros_like(y_pred))
    return -K.sum(alpha * K.pow(1. - pt_1, gamma) * K.log(pt_1))-K.sum((1-alpha) * K.pow( pt_0, gamma) * K.log(1. - pt_0))

def abs_KL_div(y_true, y_pred):
    y_true = K.clip(y_true, K.epsilon(), None)
    y_pred = K.clip(y_pred, K.epsilon(), None)
    return K.sum(K.abs( (y_true - y_pred) * (K.log(y_true / y_pred))), axis=-1)

In [19]:
# from keras.layers import Input, concatenate, Activation, Dense, LSTM, BatchNormalization, Embedding, Dropout, Lambda, Bidirectional
# from keras.metrics import categorical_accuracy, top_k_categorical_accuracy
# from keras.models import Model
# from keras.optimizers import Adam
# from keras_metrics import precision, f1_score, false_negative, true_positive, false_positive, true_negative

# # Text model
# text_input = Input(shape=(MAX_LEN,), dtype=tf.string)

# elmo_embedding = Lambda(ElmoEmbedding, output_shape=(MAX_LEN, 1024))(text_input)

# text_lstm = LSTM(
#     input_shape=(MAX_LEN, 1024,),
#     units=512,
#     recurrent_dropout=0.2,
#     dropout=0.2)(elmo_embedding)

# text_dense = Dense(200, activation='relu')(text_lstm)

# text_output = Dense(record_count, activation='sigmoid')(text_dense)

# text_model = Model(
#     inputs=text_input, 
#     outputs=text_output
# )



# from sklearn.metrics import hamming_loss

# from keras.optimizers import Adam
# adam = Adam(lr=0.0005)

# text_model.compile(
#     loss='binary_crossentropy',
#     optimizer=adam,
#     metrics=[
#         precision_m,
#         recall_m,
#         f1_m,
#         'mae',
#         abs_KL_div,
#         'accuracy'
#     ]
# )
# 
# text_model.summary()

## Compute Sample and Class Weights

Because we have skewed classes and multiple classes per example, we employ sample or class weights which weight the importance of each row according to the relative frequency of their labels.

In [20]:
from sklearn.utils.class_weight import compute_sample_weight

train_sample_weights = compute_sample_weight('balanced', y_train)
test_sample_weights = compute_sample_weight('balanced', y_test)

In [21]:
class_weights = {}
for i, tag in enumerate(sorted_all_tags):
    class_weights[i] = label_mx.shape[0] / tag_counts[tag]

class_weights

{0: 44.24890006285355,
 1: 61.80860403863038,
 2: 11.001719018596656,
 3: 51.76470588235294,
 4: 40.36697247706422,
 5: 35.84521384928717,
 6: 40.43653072946582,
 7: 9.547057228098724,
 8: 19.670298966191673,
 9: 20.72416838386812,
 10: 64.0582347588717,
 11: 64.46886446886447,
 12: 14.614905542869005,
 13: 21.43727161997564,
 14: 50.79365079365079,
 15: 8.007279344858963,
 16: 6.984126984126984,
 17: 12.867848656552733,
 18: 46.591661151555265,
 19: 69.42800788954635,
 20: 22.229239027470793,
 21: 48.65238424326192,
 22: 40.41331802525833,
 23: 9.639874024373546,
 24: 10.880989180834622,
 25: 45.86319218241042,
 26: 57.65765765765766,
 27: 55.34591194968554,
 28: 38.745184369840395,
 29: 23.56091030789826,
 30: 46.31578947368421,
 31: 58.91213389121339,
 32: 66.10328638497653}

## Establish a Log for Performance

## Simple Baseline Model using `Conv1D`

In [24]:
from keras.callbacks import ReduceLROnPlateau, ModelCheckpoint, EarlyStopping
from keras.layers import Input, Embedding, GlobalMaxPool1D, Conv1D, Dense, Activation, Dropout, Lambda
from keras.models import Model, Sequential
from keras_metrics import precision, f1_score, false_negative, true_positive, false_positive, true_negative


def build_model(max_len, label_count, dropout_ratio=0.1, filter_length=50):
    
    text_input = Input(shape=(max_len,), dtype=tf.string)

    elmo_embedding = Lambda(ElmoEmbedding, output_shape=(max_len, 1024))(text_input)

    dropout = Dropout(dropout_ratio)(elmo_embedding)

    conv1d = Conv1D(filter_length, 3, padding='valid', activation='relu', strides=1)(dropout)

    global_1d = GlobalMaxPool1D()(conv1d)

    dense = Dense(label_count, activation='sigmoid')(global_1d)

    text_model = Model(
        inputs=text_input, 
        outputs=dense
    )

    text_model.compile(
        optimizer='adam',
        loss=abs_KL_div,#'binary_crossentropy',
        metrics=[
            'categorical_accuracy',
            precision_m,
            recall_m,
            f1_m,
            'mae',
            abs_KL_div,
            true_positive(),
            false_positive(),
            true_negative(),
            false_negative(),
            'accuracy',
        ]
    )
    text_model.summary()
    
    return text_model

In [26]:
text_model = build_model(MAX_LEN, y_train.shape[1])

callbacks = [
    ReduceLROnPlateau(), 
    EarlyStopping(patience=4), 
    # ModelCheckpoint(filepath='model-conv1d.h5', save_best_only=True)
]

history = text_model.fit(
    X_train, 
    y_train,
    class_weight=class_weights,
    epochs=20,
    batch_size=BATCH_SIZE,
    validation_data=(X_test, y_test),
    callbacks=callbacks
)

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_3 (InputLayer)         (None, 100)               0         
_________________________________________________________________
lambda_3 (Lambda)            (None, 100, 1024)         0         
_________________________________________________________________
dropout_3 (Dropout)          (None, 100, 1024)         0         
_________________________________________________________________
conv1d_3 (Conv1D)            (None, 98, 50)            153650    
_________________________________________________________________
global_max_pooling1d_3 (Glob (None, 50)                0         
_________________________________________________________________
dense_3 (Dense)              (None, 33)                1683      
Total params: 155,333
Trainable params: 155,333
Non-trainable params: 0
_________________________________________________________________
Trai

In [None]:
# from keras.callbacks import EarlyStopping

# EPOCHS = 4

# history = text_model.fit(
#     X_train,
#     y_train,
#     epochs=EPOCHS,
#     batch_size=BATCH_SIZE,
#     callbacks=[
#         EarlyStopping(monitor='loss', patience=1, min_delta=0.0001),
#         EarlyStopping(monitor='val_loss', patience=1, min_delta=0.0001),
#     ],
#     class_weight=class_weights,
#     # sample_weight=train_sample_weights,
#     validation_data=(X_test, y_test)
# )

In [None]:
accr = text_model.evaluate(X_test, y_test, sample_weight=test_sample_weights)
[i for i in zip(accr, text_model.metrics_names)]

In [None]:
%matplotlib inline

import matplotlib.pyplot as plt

print(history.history)
# summarize history for accuracy
plt.plot(history.history['val_loss'])
plt.plot(history.history['f1_m'])
plt.plot(history.history['categorical_accuracy'])
plt.plot(history.history['mean_absolute_error'])
plt.plot(history.history['precision_m'])
plt.title('model accuracy')
plt.ylabel('metric')
plt.xlabel('epoch')
plt.legend(['val_loss', 'f1', 'categorical accuracy', 'MAE', 'precision'], loc='upper left')
plt.show()

# summarize history for loss
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper left')
plt.show()

In [None]:
import statistics

from sklearn.metrics import hamming_loss, jaccard_score
import keras.backend as K
import tensorflow as tf

y_pred = text_model.predict(X_test)

sess = tf.Session()
best_cutoff = 0
max_score = 0
with sess.as_default():
    for cutoff in [0.0001, 0.001, 0.01, 0.1, 0.2, 0.4, 0.5, 0.6, 0.8]:
        y_pred_bin = K.greater(y_pred, cutoff).eval()
        print('Cutoff: {:,}'.format(cutoff))
        print('Hamming loss: {:,}'.format(
            hamming_loss(y_test, y_pred_bin)
        ))
        scores = []
        for j_type in ['micro', 'macro', 'weighted']:
            j_score = jaccard_score(y_test, y_pred_bin, average=j_type)
            print('Jaccard {} score: {:,}'.format(
                j_type,
                j_score
            ))
            scores.append(j_score)
        print('')
        mean_score = statistics.mean(scores)
        if mean_score > max_score:
            best_cutoff = cutoff
            max_score = mean_score

print('Best cutoff was: {:,} with mean jaccard score of {:,}'.format(best_cutoff, max_score))

In [None]:
y_pred

In [None]:
from sklearn.metrics import classification_report, multilabel_confusion_matrix

y_pred = text_model.predict(X_test, batch_size=32, verbose=1)
y_pred_bool = np.where(y_pred > best_cutoff, 1, 0)

print(classification_report(y_test, y_pred_bool))

print(multilabel_confusion_matrix(y_test, y_pred_bool))

## View the Results

Now lets map from the one-hot-encoded tags back to the text tags and view them alongside the text of the original posts to sanity check the model and see if it really works.

In [None]:
predicted_tags = []
for test, pred in zip(y_test, y_pred_bool):
    tags = []
    for i, val in enumerate(test):
        if pred[i] == 1.0:
            tags.append(sorted_all_tags[i])
    predicted_tags.append(tags)

for text, tags in zip(X_test, predicted_tags):
    print(' '.join(text), tags)