# Text Classification using Pre-trained Word Embeddings

This notebook uses pre-trained word embeddings (numerical vector representations of words) to train a custom neural network classifier.  This work is adapted from the excellent work of a keras example by Francois Chollet, the creator of Keras.  I highly recommend his book - Deep Learning with Python.

For the raw text dataset - we'll use the popular Newsgroup20 dataset, a set of 20,000 message board messages belonging to 20 different topic categories.

For the pre-trained word embeddings, we'll use [GloVe embeddings](http://nlp.stanford.edu/projects/glove/).

## Setup

In [15]:
import numpy as np
import tensorflow as tf
from tensorflow import keras
import os
import pathlib
from tensorflow.keras.layers.experimental.preprocessing import TextVectorization
import matplotlib.pyplot as plt
import sagemaker

sess = sagemaker.Session()
role = sagemaker.get_execution_role()
prefix = 'keras-text-classification'

## Download the Newsgroup20 data

In [16]:
data_path = keras.utils.get_file(
    "news20.tar.gz",
    "http://www.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/news20.tar.gz",
    untar=True,
    cache_dir=".",
    cache_subdir="data"
    
)

Downloading data from http://www.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/news20.tar.gz


## Let's take a look at the data

In [17]:
data_dir = pathlib.Path("./data/20_newsgroup")
dirnames = os.listdir(data_dir)
print("Number of directories:", len(dirnames))
print("Directory names:", dirnames)

fnames = os.listdir(data_dir / "comp.graphics")
print("Number of files in comp.graphics:", len(fnames))
print("Some example filenames:", fnames[:5])

Number of directories: 20
Directory names: ['talk.politics.mideast', 'rec.motorcycles', 'sci.space', 'soc.religion.christian', 'misc.forsale', 'rec.sport.hockey', 'comp.sys.mac.hardware', 'comp.windows.x', 'comp.os.ms-windows.misc', 'talk.politics.guns', 'sci.crypt', 'talk.religion.misc', 'sci.electronics', 'sci.med', 'alt.atheism', 'rec.autos', 'comp.sys.ibm.pc.hardware', 'talk.politics.misc', 'comp.graphics', 'rec.sport.baseball']
Number of files in comp.graphics: 1000
Some example filenames: ['38823', '38685', '38590', '38513', '38566']


Here's a example of what one file contains:

In [18]:
print(open(data_dir / "comp.graphics" / "38987").read())

Newsgroups: comp.graphics
Path: cantaloupe.srv.cs.cmu.edu!das-news.harvard.edu!noc.near.net!howland.reston.ans.net!agate!dog.ee.lbl.gov!network.ucsd.edu!usc!rpi!nason110.its.rpi.edu!mabusj
From: mabusj@nason110.its.rpi.edu (Jasen M. Mabus)
Subject: Looking for Brain in CAD
Message-ID: <c285m+p@rpi.edu>
Nntp-Posting-Host: nason110.its.rpi.edu
Reply-To: mabusj@rpi.edu
Organization: Rensselaer Polytechnic Institute, Troy, NY.
Date: Thu, 29 Apr 1993 23:27:20 GMT
Lines: 7

Jasen Mabus
RPI student

	I am looking for a hman brain in any CAD (.dxf,.cad,.iges,.cgm,etc.) or picture (.gif,.jpg,.ras,etc.) format for an animation demonstration. If any has or knows of a location please reply by e-mail to mabusj@rpi.edu.

Thank you in advance,
Jasen Mabus  



As you can see, there are header lines that are leaking the file's category, either
explicitly (the first line is literally the category name), or implicitly, e.g. via the
`Organization` filed. Let's get rid of the headers:

In [19]:
samples = []
labels = []
class_names = []
class_index = 0
for dirname in sorted(os.listdir(data_dir)):
    class_names.append(dirname)
    dirpath = data_dir / dirname
    fnames = os.listdir(dirpath)
    print("Processing %s, %d files found" % (dirname, len(fnames)))
    for fname in fnames:
        fpath = dirpath / fname
        f = open(fpath, encoding="latin-1")
        content = f.read()
        lines = content.split("\n")
        lines = lines[10:]
        content = "\n".join(lines)
        samples.append(content)
        labels.append(class_index)
    class_index += 1

print("Classes:", class_names)
print("Number of samples:", len(samples))

Processing alt.atheism, 1000 files found
Processing comp.graphics, 1000 files found
Processing comp.os.ms-windows.misc, 1000 files found
Processing comp.sys.ibm.pc.hardware, 1000 files found
Processing comp.sys.mac.hardware, 1000 files found
Processing comp.windows.x, 1000 files found
Processing misc.forsale, 1000 files found
Processing rec.autos, 1000 files found
Processing rec.motorcycles, 1000 files found
Processing rec.sport.baseball, 1000 files found
Processing rec.sport.hockey, 1000 files found
Processing sci.crypt, 1000 files found
Processing sci.electronics, 1000 files found
Processing sci.med, 1000 files found
Processing sci.space, 1000 files found
Processing soc.religion.christian, 997 files found
Processing talk.politics.guns, 1000 files found
Processing talk.politics.mideast, 1000 files found
Processing talk.politics.misc, 1000 files found
Processing talk.religion.misc, 1000 files found
Classes: ['alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.ha

There's actually one category that doesn't have the expected number of files, but the
difference is small enough that the problem remains a balanced classification problem.

In [20]:
print(samples[3000])
print(labels[3000])
print(class_names[labels[3000]])

NNTP-Posting-Host: oak.circa.ufl.edu


I'm using int15h to read my joystick, and it is hideously slow.  Something
like 90% of my CPU time is being spent reading the joystick, and this
is in a program that does nothing but printf() and JoyRead().

The problem is that a lot of programs trap int15h ( like SMARTDRV ) and
so it is a slow as hell interface.  Can I read the joystick port in
a reasonably safe fashion via polling?  And that isn't platform or
clockspeed specific?

Thanks,

Brianzex


3
comp.sys.ibm.pc.hardware


## Shuffle and split the data into training & validation sets

In [21]:
# Shuffle the data (in place)
seed = 1234
rng = np.random.RandomState(seed)
rng.shuffle(samples)
rng = np.random.RandomState(seed)
rng.shuffle(labels)

# Extract a training & validation split
validation_split = 0.2
num_validation_samples = int(validation_split * len(samples))
train_samples = samples[:-num_validation_samples]
val_samples = samples[-num_validation_samples:]
train_labels = labels[:-num_validation_samples]
val_labels = labels[-num_validation_samples:]

## Create a vocabulary index

Let's use the `TextVectorization` to index the vocabulary found in the dataset.
Later, we'll use the same layer instance to vectorize the samples.

Our layer will only consider the top 20,000 words, and will truncate or pad sequences to
be actually 200 tokens long.

In [22]:
%%time

vectorizer = TextVectorization(max_tokens=20000, output_sequence_length=200)
text_ds = tf.data.Dataset.from_tensor_slices(train_samples).batch(128)
vectorizer.adapt(text_ds)

CPU times: user 4.25 s, sys: 0 ns, total: 4.25 s
Wall time: 4.16 s


Let's vectorize a test sentence:

In [23]:
output = vectorizer([["the cat sat on the mat"]])
print(output.numpy()[0, :6])
len(output[0,:])

[   2 3789 1740   15    2 5795]


200

As you can see, "the" gets represented as "2". Why not 0, given that "the" was the first
word in the vocabulary? That's because index 0 is reserved for padding and index 1 is
reserved for "out of vocabulary" tokens.

In [24]:
vocab = vectorizer.get_vocabulary()
vocab = [a.decode('utf-8') for a in vocab]
vocab.insert(0,'[UNK]')
vocab.insert(0,'')

## Load pre-trained word embeddings

Let's download pre-trained GloVe embeddings (a 822M zip file).

You'll need to run the following commands:

In [27]:
!wget http://nlp.stanford.edu/data/glove.6B.zip
!unzip -q glove.6B.zip

--2021-01-12 22:45:54--  http://nlp.stanford.edu/data/glove.6B.zip
Resolving nlp.stanford.edu (nlp.stanford.edu)... 171.64.67.140
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://nlp.stanford.edu/data/glove.6B.zip [following]
--2021-01-12 22:45:55--  https://nlp.stanford.edu/data/glove.6B.zip
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: http://downloads.cs.stanford.edu/nlp/data/glove.6B.zip [following]
--2021-01-12 22:45:55--  http://downloads.cs.stanford.edu/nlp/data/glove.6B.zip
Resolving downloads.cs.stanford.edu (downloads.cs.stanford.edu)... 171.64.64.22
Connecting to downloads.cs.stanford.edu (downloads.cs.stanford.edu)|171.64.64.22|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 862182613 (822M) [application/zip]
Saving to: ‘glove.6B.zip’


2021-0

The archive contains text-encoded vectors of various sizes: 50-dimensional,
100-dimensional, 200-dimensional, 300-dimensional. We'll use the 100D ones.

Let's make a dict mapping words (strings) to their NumPy vector representation:

In [28]:
path_to_glove_file = pathlib.Path('./glove.6B.100d.txt')

embeddings_index = {}
with open(path_to_glove_file) as f:
    for line in f:
        word, coefs = line.split(maxsplit=1)
        coefs = np.fromstring(coefs, "f", sep=" ")
        embeddings_index[word] = coefs

print("Found %s word vectors." % len(embeddings_index))

Found 400000 word vectors.


In [29]:
embeddings_index['baseball']

array([ 8.0381e-01,  4.6716e-01,  5.5460e-01, -5.0325e-01, -9.0828e-01,
       -9.8833e-04,  1.8065e-01, -3.0682e-01, -8.8492e-01, -6.3617e-01,
       -3.7251e-01, -1.1336e+00,  6.4746e-01, -1.3095e-01, -1.9357e-01,
        8.0117e-02,  1.3667e+00,  1.0113e+00,  1.7041e-01,  1.3550e-01,
       -2.6088e-01,  9.5558e-01, -3.7744e-01, -3.2777e-01,  6.7479e-01,
       -8.2864e-02, -5.3688e-01, -1.0528e+00,  2.4914e-01,  9.2037e-01,
       -1.8600e-01,  9.4798e-01, -1.6681e-01,  4.6843e-02, -2.4946e-01,
        2.6076e-02, -1.1478e+00,  4.2764e-01, -8.3345e-01, -8.1160e-02,
        3.9547e-01, -3.4715e-02,  2.8523e-01, -9.5508e-01, -1.5865e-01,
        4.4431e-02,  9.0042e-01, -5.9723e-01,  7.3605e-02, -7.5065e-01,
       -2.2557e-01, -1.4947e-01,  1.0915e-01,  2.0668e-01,  8.1028e-02,
       -1.4774e+00, -6.3596e-02,  4.2345e-01,  1.5685e+00,  1.6096e+00,
       -1.1021e+00,  9.1121e-01, -3.5620e-01, -4.7878e-01,  4.4527e-01,
       -4.1572e-01,  7.0802e-01,  2.1506e-01, -2.0303e-01,  4.63

Now, let's prepare a corresponding embedding matrix that we can use in a Keras
`Embedding` layer. It's a simple NumPy matrix where entry at index `i` is the pre-trained
vector for the word of index `i` in our `vectorizer`'s vocabulary.

In [30]:
num_tokens = len(vocab)
embedding_dim = 100
hits = 0
misses = 0

# Prepare embedding matrix
embedding_matrix = np.zeros((num_tokens, embedding_dim))
for i,word in enumerate(vocab):
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        # Words not found in embedding index will be all-zeros.
        # This includes the representation for "padding" and "OOV"
        embedding_matrix[i] = embedding_vector
        hits += 1
    else:
        misses += 1
print("Converted %d words (%d misses)" % (hits, misses))


Converted 18021 words (1980 misses)


In [31]:
np.savez('./data/embedding', embedding=embedding_matrix)

In [32]:
embedding_path = sess.upload_data('data/embedding.npz', key_prefix=prefix+'/embedding')

## Train the model

First, convert our list-of-strings data to NumPy arrays of integer indices. The arrays
are right-padded.

In [33]:
%%time 

x_train = vectorizer(np.array([[s] for s in train_samples])).numpy()
x_val = vectorizer(np.array([[s] for s in val_samples])).numpy()

y_train = np.array(train_labels)
y_val = np.array(val_labels)

CPU times: user 7.41 s, sys: 6.92 s, total: 14.3 s
Wall time: 14.3 s


In [34]:
np.savez('./data/training', text=x_train, label=y_train)
np.savez('./data/validation', text=x_val, label=y_val)

In [35]:
training_input_path   = sess.upload_data('data/training.npz', key_prefix=prefix+'/training')
validation_input_path = sess.upload_data('data/validation.npz', key_prefix=prefix+'/validation')

print(training_input_path)
print(validation_input_path)

s3://sagemaker-us-east-1-431615879134/keras-text-classification/training/training.npz
s3://sagemaker-us-east-1-431615879134/keras-text-classification/validation/validation.npz


We use categorical crossentropy as our loss since we're doing softmax classification.
Moreover, we use `sparse_categorical_crossentropy` since our labels are integers.

In [36]:
from sagemaker.tensorflow import TensorFlow

tf_estimator = TensorFlow(entry_point='./src/text_classification_keras_tf.py', 
                          role=role,
                          instance_count=1, 
                          instance_type='ml.p3.2xlarge',#instance_type='local_gpu',
                          framework_version='2.1.0', 
                          py_version='py3',
                          script_mode=True,
                          hyperparameters={'epochs': 20}
                         )

In [37]:
tf_estimator.fit({'training': training_input_path, 'validation': validation_input_path, 'embedding': embedding_path})

2021-01-12 22:54:51 Starting - Starting the training job...
2021-01-12 22:55:15 Starting - Launching requested ML instancesProfilerReport-1610492091: InProgress
.........
2021-01-12 22:56:36 Starting - Preparing the instances for training......
2021-01-12 22:57:43 Downloading - Downloading input data...
2021-01-12 22:58:18 Training - Downloading the training image......
2021-01-12 22:59:18 Training - Training image download completed. Training in progress..[34m2021-01-12 22:59:20,902 sagemaker-containers INFO     Imported framework sagemaker_tensorflow_container.training[0m
[34m2021-01-12 22:59:21,379 sagemaker-containers INFO     Invoking user script
[0m
[34mTraining Env:
[0m
[34m{
    "additional_framework_parameters": {},
    "channel_input_dirs": {
        "training": "/opt/ml/input/data/training",
        "embedding": "/opt/ml/input/data/embedding",
        "validation": "/opt/ml/input/data/validation"
    },
    "current_host": "algo-1",
    "framework_module": "sagemaker_

## Deploy and Test the model

In [38]:
tf_predictor = tf_estimator.deploy(initial_instance_count=1,instance_type='ml.m5.xlarge')     

update_endpoint is a no-op in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.


-------------!

In [39]:
out = tf_predictor.predict(vectorizer([["this message is about computer graphics and 3D modeling"]]).numpy())

In [40]:
out

{'predictions': [[1.74934538e-16,
   0.999999881,
   1.73168131e-12,
   2.88640477e-13,
   3.91689285e-12,
   8.87516283e-10,
   5.77041921e-11,
   1.14515149e-21,
   6.06566683e-19,
   9.25052474e-14,
   6.30532737e-10,
   6.36439593e-15,
   2.19158814e-14,
   6.9012934e-08,
   8.12032247e-11,
   1.64939595e-13,
   7.55520728e-20,
   7.57738946e-16,
   2.43249e-16,
   9.36628585e-17]]}

In [41]:
print(f"Predicted - {class_names[np.argmax(out['predictions'][0])]} with {np.round(max(out['predictions'][0])*100,2)}% confidence")

Predicted - comp.graphics with 100.0% confidence


In [42]:
def predict_string(string):
    out = tf_predictor.predict(vectorizer([[string]]).numpy())
    print(f"Predicted Class -- {class_names[np.argmax(out['predictions'][0])]} -- with {np.round(max(out['predictions'][0])*100,2)}% confidence")
    return

In [43]:
predict_string("On thursday we went to church for Christmas")

Predicted Class -- soc.religion.christian -- with 99.87% confidence


In [44]:
predict_string("Apollo 13 was the seventh crewed mission in the Apollo space program and the third meant to land on the Moon. The craft was launched from Kennedy Space Center on April 11, 1970, but the lunar landing was aborted after an oxygen tank in the service module failed two days into the mission")

Predicted Class -- sci.space -- with 100.0% confidence


In [45]:
predict_string("The ram was insufficient to run my program on my new apple computer")

Predicted Class -- comp.sys.mac.hardware -- with 93.38% confidence


In [53]:
predict_string("The RAM was insufficient to run my program on my new IBM computer")

Predicted Class -- comp.sys.ibm.pc.hardware -- with 67.2% confidence


In [57]:
predict_string("The out was recorded before the runner left the base")

Predicted Class -- rec.sport.baseball -- with 56.25% confidence


## Clean Up Resources

In [154]:
sess.delete_endpoint(endpoint_name=tf_predictor.endpoint_name)