#Text classification with pretrained word embeddings (GLOve) 

This example trains a text classification model using pre-trained word embeddings. (GloVe embeddings)

I have used Newsgroup20 dataset which consists of a set of 20,000 message board messages belonging to 20 different topic categories.

In [1]:
# Import required libraries

import numpy as np
import tensorflow as tf
from tensorflow import keras
import os
import pathlib
from tensorflow.keras.layers.experimental.preprocessing import TextVectorization
from tensorflow.keras.layers import Embedding
from tensorflow.keras import layers

## Downloading the Newsgroup20 dataset

In [2]:
dataset_path = keras.utils.get_file(
    "news20.tar.gz",
    "http://www.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/news20.tar.gz",
    untar=True,
)

Downloading data from http://www.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/news20.tar.gz


In [3]:
dataset_dir = pathlib.Path(dataset_path).parent / "20_newsgroup"
dir_names = os.listdir(dataset_dir)
print("Number of directories: {} ".format(len(dir_names)))
print("Directory names: {}".format(dir_names))

f_names = os.listdir(dataset_dir / "comp.graphics")
print("Number of files in comp.graphics: {}".format(len(f_names)))
print("Some example filenames: {}".format(f_names[:5]))

Number of directories: 20 
Directory names: ['rec.autos', 'comp.sys.mac.hardware', 'rec.sport.baseball', 'talk.politics.guns', 'rec.sport.hockey', 'comp.sys.ibm.pc.hardware', 'sci.med', 'comp.windows.x', 'rec.motorcycles', 'comp.graphics', 'comp.os.ms-windows.misc', 'talk.religion.misc', 'sci.electronics', 'soc.religion.christian', 'talk.politics.misc', 'alt.atheism', 'sci.crypt', 'misc.forsale', 'sci.space', 'talk.politics.mideast']
Number of files in comp.graphics: 1000
Some example filenames: ['38239', '37962', '39028', '38713', '38262']


Checking the file content.

In [4]:
print(open(dataset_dir / "comp.graphics" / "38987").read())

Newsgroups: comp.graphics
Path: cantaloupe.srv.cs.cmu.edu!das-news.harvard.edu!noc.near.net!howland.reston.ans.net!agate!dog.ee.lbl.gov!network.ucsd.edu!usc!rpi!nason110.its.rpi.edu!mabusj
From: mabusj@nason110.its.rpi.edu (Jasen M. Mabus)
Subject: Looking for Brain in CAD
Message-ID: <c285m+p@rpi.edu>
Nntp-Posting-Host: nason110.its.rpi.edu
Reply-To: mabusj@rpi.edu
Organization: Rensselaer Polytechnic Institute, Troy, NY.
Date: Thu, 29 Apr 1993 23:27:20 GMT
Lines: 7

Jasen Mabus
RPI student

	I am looking for a hman brain in any CAD (.dxf,.cad,.iges,.cgm,etc.) or picture (.gif,.jpg,.ras,etc.) format for an animation demonstration. If any has or knows of a location please reply by e-mail to mabusj@rpi.edu.

Thank you in advance,
Jasen Mabus  




As the header lines are revealing the file's category, we will delete header section from all files.

In [5]:
sample = []
label = []
class_name = []
class_index = 0
for dir_name in sorted(os.listdir(dataset_dir)):
    class_name.append(dir_name)
    dir_path = dataset_dir / dir_name
    f_names = os.listdir(dir_path)
    print("Processing %s, %d files found" % (dir_name, len(f_names)))
    for f in f_names:
        fpath = dir_path / f
        f = open(fpath, encoding="latin-1")
        content = f.read()
        lines = content.split("\n")
        lines = lines[10:]
        content = "\n".join(lines)
        sample.append(content)
        label.append(class_index)
    class_index += 1

print("Classes: {}".format(class_name))
print("Number of sample: {}".format(len(sample)))

Processing alt.atheism, 1000 files found
Processing comp.graphics, 1000 files found
Processing comp.os.ms-windows.misc, 1000 files found
Processing comp.sys.ibm.pc.hardware, 1000 files found
Processing comp.sys.mac.hardware, 1000 files found
Processing comp.windows.x, 1000 files found
Processing misc.forsale, 1000 files found
Processing rec.autos, 1000 files found
Processing rec.motorcycles, 1000 files found
Processing rec.sport.baseball, 1000 files found
Processing rec.sport.hockey, 1000 files found
Processing sci.crypt, 1000 files found
Processing sci.electronics, 1000 files found
Processing sci.med, 1000 files found
Processing sci.space, 1000 files found
Processing soc.religion.christian, 997 files found
Processing talk.politics.guns, 1000 files found
Processing talk.politics.mideast, 1000 files found
Processing talk.politics.misc, 1000 files found
Processing talk.religion.misc, 1000 files found
Classes: ['alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.ha

## Shuffle and split the data into training and validation datasets

In [6]:
# Shuffling the data
seed = 1337
rng = np.random.RandomState(seed)
rng.shuffle(sample)
rng = np.random.RandomState(seed)
rng.shuffle(label)

# Extracting a training & validation split
validation_split = 0.2
num_validation_sample = int(validation_split * len(sample))
train_sample = sample[:-num_validation_sample]
val_sample = sample[-num_validation_sample:]
train_label = label[:-num_validation_sample]
val_label = label[-num_validation_sample:]

## Create a vocabulary index


In [7]:
# Using TextVectorization to index vocabulary present in dataset.

vectorizer = TextVectorization(max_tokens=20000, output_sequence_length=100)  
text_dataset = tf.data.Dataset.from_tensor_slices(train_sample).batch(128)
vectorizer.adapt(text_dataset)

In [8]:
# Printing top 5 words from the computed vocabulary
vectorizer.get_vocabulary()[:5]

['', '[UNK]', 'the', 'to', 'of']

Let's vectorize a test sentence:

In [9]:
# Vectorizing the sample sentence
output = vectorizer([["the cat sat on the yellow mat"]])
output.numpy()[0, :6]

array([   2, 3811, 1713,   15,    2, 5115])

In [10]:
# Mapping words to indices
voc = vectorizer.get_vocabulary()
word_index = dict(zip(voc, range(len(voc))))

In [11]:
test_sen = ["the", "cat", "sat", "on", "the",'yellow', "mat"]
[word_index[w] for w in test_sen]

[2, 3811, 1713, 15, 2, 5115, 6091]

The obtained encoding is same as above encoding of our sample sentence.

## Load pre-trained word embeddings

In [12]:
# Download pre-trained GloVe embeddings
!wget http://nlp.stanford.edu/data/glove.6B.zip
!unzip -q glove.6B.zip

--2020-12-11 02:25:52--  http://nlp.stanford.edu/data/glove.6B.zip
Resolving nlp.stanford.edu (nlp.stanford.edu)... 171.64.67.140
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://nlp.stanford.edu/data/glove.6B.zip [following]
--2020-12-11 02:25:52--  https://nlp.stanford.edu/data/glove.6B.zip
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: http://downloads.cs.stanford.edu/nlp/data/glove.6B.zip [following]
--2020-12-11 02:25:52--  http://downloads.cs.stanford.edu/nlp/data/glove.6B.zip
Resolving downloads.cs.stanford.edu (downloads.cs.stanford.edu)... 171.64.64.22
Connecting to downloads.cs.stanford.edu (downloads.cs.stanford.edu)|171.64.64.22|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 862182613 (822M) [application/zip]
Saving to: ‘glove.6B.zip’


2020-1

In [13]:
# Checking current directory
!pwd

/content


In [14]:
# Checking the files present in current directory
!ls

glove.6B.100d.txt  glove.6B.300d.txt  glove.6B.zip
glove.6B.200d.txt  glove.6B.50d.txt   sample_data


In [15]:
# Making a dict mapping strings to a NumPy vector reproesentation
path_to_glove = os.path.join("/content/glove.6B.100d.txt")

In [16]:
embedding_index= {}
with open(path_to_glove) as f:
    for line in f:
        word, coefs = line.split(maxsplit=1)
        coefs = np.fromstring(coefs, "f", sep=" ")
        embedding_index[word] = coefs

print("Found {} word vectors.".format(len(embedding_index)))

Found 400000 word vectors.


Prepare a corresponding embedding matrix which can be used in a Keras
`Embedding` layer. It's a simple NumPy matrix having entry at index `i` as the pre-trained
vector for the word of index `i` in the `vectorizer`'s vocabulary.

In [17]:
number_of_tokens = len(voc) + 2
embedding_dim = 100
hits = 0
misses = 0

# Preparing a embedding matrix
embedding_matrix = np.zeros((number_of_tokens, embedding_dim))
for word, i in word_index.items():
    embedding_vector = embedding_index.get(word)
    if embedding_vector is not None:
        embedding_matrix[i] = embedding_vector
        hits += 1
    else:
        misses += 1
print("Converted {} words ({} misses)".format(hits, misses))


Converted 17984 words (2016 misses)


In [18]:
#Loading the pre-trained word embeddings matrix into `Embedding` layer.

embedding_layer = Embedding(
    number_of_tokens,
    embedding_dim,
    embeddings_initializer=keras.initializers.Constant(embedding_matrix),
    trainable=False,
)

## Build the model



In [19]:
int_sequence_input = keras.Input(shape=(None,), dtype="int64")
embedded_sequence = embedding_layer(int_sequence_input)
x = layers.Conv1D(128, 5, activation="relu")(embedded_sequence)
x = layers.MaxPooling1D(5)(x)
x = layers.Conv1D(128, 5, activation="relu")(x)
x = layers.MaxPooling1D(5)(x)
x = layers.Conv1D(128, 5, activation="relu")(x)
x = layers.GlobalMaxPooling1D()(x)
x = layers.Dense(128, activation="relu")(x)
x = layers.Dropout(0.5)(x)
pred = layers.Dense(len(class_name), activation="softmax")(x)
model_1 = keras.Model(int_sequence_input, pred)
model_1.summary()

Model: "functional_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_1 (InputLayer)         [(None, None)]            0         
_________________________________________________________________
embedding (Embedding)        (None, None, 100)         2000200   
_________________________________________________________________
conv1d (Conv1D)              (None, None, 128)         64128     
_________________________________________________________________
max_pooling1d (MaxPooling1D) (None, None, 128)         0         
_________________________________________________________________
conv1d_1 (Conv1D)            (None, None, 128)         82048     
_________________________________________________________________
max_pooling1d_1 (MaxPooling1 (None, None, 128)         0         
_________________________________________________________________
conv1d_2 (Conv1D)            (None, None, 128)        

## Train the model

Convert the list-of-strings data to NumPy arrays of integer indices. 

In [20]:
x_train = vectorizer(np.array([[s] for s in train_sample])).numpy()
x_val = vectorizer(np.array([[s] for s in val_sample])).numpy()

y_train = np.array(train_label)
y_val = np.array(val_label)

In [21]:
model_1.compile(loss="sparse_categorical_crossentropy", optimizer="rmsprop", metrics=["acc"])
model_1.fit(x_train, y_train, batch_size=128, epochs=20, validation_data=(x_val, y_val))

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


<tensorflow.python.keras.callbacks.History at 0x7f7f9521d710>

## Export an end-to-end model

Export a model to make it portable.

In [22]:
string_input = keras.Input(shape=(1,), dtype="string")
x = vectorizer(string_input)
preds = model_1(x)
end_to_end_model = keras.Model(string_input, preds)

probability = end_to_end_model.predict([["This message is about computer graphics and 3D modeling"]])

class_name[np.argmax(probability[0])]

'comp.graphics'