<a href="https://colab.research.google.com/github/ritwiks9635/My_Neural_Network_Architecture/blob/main/Using_pre_trained_word_embeddings.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#**Using pre-trained word embeddings**

## Introduction

In this example, we show how to train a text classification model that uses pre-trained
word embeddings.

We'll work with the Newsgroup20 dataset, a set of 20,000 message board messages
belonging to 20 different topic categories.

For the pre-trained word embeddings, we'll use
[GloVe embeddings](http://nlp.stanford.edu/projects/glove/).

In [2]:
import os

# Only the TensorFlow backend supports string inputs.
os.environ["KERAS_BACKEND"] = "tensorflow"

import pathlib
import numpy as np
import tensorflow as tf
import tensorflow.data as tf_data
import keras
from keras import layers

In [3]:
data_path = keras.utils.get_file(
    "news20.tar.gz",
    "http://www.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/news20.tar.gz",
    untar=True,
)

In [4]:
data_dir = pathlib.Path(data_path).parent / "20_newsgroup"
dirnames = os.listdir(data_dir)
print("Number of directories:", len(dirnames))
print("Directory names:", dirnames)

fnames = os.listdir(data_dir / "comp.graphics")
print("Number of files in comp.graphics:", len(fnames))
print("Some example filenames:", fnames[:5])

Number of directories: 20
Directory names: ['soc.religion.christian', 'sci.med', 'sci.crypt', 'rec.sport.baseball', 'comp.sys.ibm.pc.hardware', 'rec.sport.hockey', 'comp.windows.x', 'comp.sys.mac.hardware', 'talk.religion.misc', 'sci.space', 'talk.politics.mideast', 'comp.graphics', 'sci.electronics', 'misc.forsale', 'talk.politics.misc', 'rec.autos', 'comp.os.ms-windows.misc', 'rec.motorcycles', 'alt.atheism', 'talk.politics.guns']
Number of files in comp.graphics: 1000
Some example filenames: ['38965', '39050', '38626', '38561', '38730']


In [5]:
os.listdir(data_dir / "comp.windows.x")[:5]

['67378', '67557', '68176', '68213', '66445']

In [6]:
print(open(data_dir / "comp.windows.x"/ "67260").read())

Xref: cantaloupe.srv.cs.cmu.edu comp.windows.x.motif:16776 comp.windows.x:67260 comp.human-factors:4654 comp.windows.ms:36407 comp.windows.open-look:8416
Newsgroups: comp.windows.x.motif,comp.windows.x,comp.human-factors,comp.windows.ms,comp.windows.open-look
Path: cantaloupe.srv.cs.cmu.edu!rochester!udel!gatech!howland.reston.ans.net!zaphod.mps.ohio-state.edu!rphroy!kocrsv01!c2xjfa
From: c2xjfa@kocrsv01.delcoelect.com (James F Allman III)
Subject: Re: GUI Study
Message-ID: <1993Apr23.201024.13895@kocrsv01.delcoelect.com>
Originator: c2xjfa@koptsw18
Sender: news@kocrsv01.delcoelect.com (Usenet News Account)
Organization: Delco Electronics Corp.
References: <1993Apr2.203400.15357@kocrsv01.delcoelect.com> <1993Apr7.144905.9827@thunder.mcrcim.mcgill.edu> <1993Apr13.104408.24613@mnemosyne.cs.du.edu> <1993Apr23.031744.19111@mercury.unt.edu>
Distribution: na
Date: Fri, 23 Apr 1993 20:10:24 GMT
Lines: 33


In article <1993Apr23.031744.19111@mercury.unt.edu>, seth@ponder.csci.unt.edu (Seth Buf

As you can see, there are header lines that are leaking the file's category, either
explicitly (the first line is literally the category name), or implicitly, e.g. via the
`Organization` filed. Let's get rid of the headers:

In [7]:
samples = []
labels = []
class_names = []
class_index = 0

for dirname in sorted(os.listdir(data_dir)):
    class_names.append(dirname)
    dirpath = data_dir / dirname
    fnames = os.listdir(dirpath)
    print(f"processing {dirname}, {len(fnames)} files found")
    for fname in fnames:
        fpath = dirpath / fname
        f = open(fpath, encoding = "latin-1")
        content = f.read()
        lines = content.split("\n")
        lines = lines[10:]
        content = "\n".join(lines)
        samples.append(content)
        labels.append(class_index)
    class_index += 1

print("Classes:", class_names)
print("Number of samples:", len(samples))

processing alt.atheism, 1000 files found
processing comp.graphics, 1000 files found
processing comp.os.ms-windows.misc, 1000 files found
processing comp.sys.ibm.pc.hardware, 1000 files found
processing comp.sys.mac.hardware, 1000 files found
processing comp.windows.x, 1000 files found
processing misc.forsale, 1000 files found
processing rec.autos, 1000 files found
processing rec.motorcycles, 1000 files found
processing rec.sport.baseball, 1000 files found
processing rec.sport.hockey, 1000 files found
processing sci.crypt, 1000 files found
processing sci.electronics, 1000 files found
processing sci.med, 1000 files found
processing sci.space, 1000 files found
processing soc.religion.christian, 997 files found
processing talk.politics.guns, 1000 files found
processing talk.politics.mideast, 1000 files found
processing talk.politics.misc, 1000 files found
processing talk.religion.misc, 1000 files found
Classes: ['alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.ha

There's actually one category that doesn't have the expected number of files, but the
difference is small enough that the problem remains a balanced classification problem.



## Shuffle and split the data into training & validation sets

In [8]:
# Shuffle the data
seed = 1337
rng = np.random.RandomState(seed)
rng.shuffle(samples)
rng = np.random.RandomState(seed)
rng.shuffle(labels)

# Extract a training & validation split
validation_split = 0.2
num_validation_samples = int(validation_split * len(samples))
train_samples = samples[:-num_validation_samples]
val_samples = samples[-num_validation_samples:]
train_labels = labels[:-num_validation_samples]
val_labels = labels[-num_validation_samples:]

## Create a vocabulary index

Let's use the `TextVectorization` to index the vocabulary found in the dataset.
Later, we'll use the same layer instance to vectorize the samples.

Our layer will only consider the top 20,000 words, and will truncate or pad sequences to
be actually 200 tokens long.

In [9]:
vectorizer = layers.TextVectorization(max_tokens = 20000, output_sequence_length = 200)
text_ds = tf_data.Dataset.from_tensor_slices(train_samples).batch(128)
vectorizer.adapt(text_ds)

In [10]:
vectorizer(["the cat is a black or dog is white"]).numpy()[0,:9]

array([   2, 3567,    8,    5,  570,   22, 1811,    8,  627])

As you can see, "the" gets represented as "2". Why not 0, given that "the" was the first
word in the vocabulary? That's because index 0 is reserved for padding and index 1 is
reserved for "out of vocabulary" tokens.

Here's a dict mapping words to their indices:

In [11]:
voc = vectorizer.get_vocabulary()
word_index = dict(zip(voc, range(len(voc))))

In [12]:
text = ["the", "cat", "sat", "on", "the", "mat"]
[word_index[i] for i in text]

[2, 3567, 1819, 15, 2, 5818]

## Load pre-trained word embeddings



Let's download pre-trained GloVe embeddings (a 822M zip file).

You'll need to run the following commands:

In [13]:
!wget https://downloads.cs.stanford.edu/nlp/data/glove.6B.zip
!unzip -q glove.6B.zip

--2023-12-06 05:38:25--  https://downloads.cs.stanford.edu/nlp/data/glove.6B.zip
Resolving downloads.cs.stanford.edu (downloads.cs.stanford.edu)... 171.64.64.22
Connecting to downloads.cs.stanford.edu (downloads.cs.stanford.edu)|171.64.64.22|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 862182613 (822M) [application/zip]
Saving to: ‘glove.6B.zip.1’


2023-12-06 05:41:06 (5.11 MB/s) - ‘glove.6B.zip.1’ saved [862182613/862182613]

replace glove.6B.50d.txt? [y]es, [n]o, [A]ll, [N]one, [r]ename: 

The archive contains text-encoded vectors of various sizes: 50-dimensional,
100-dimensional, 200-dimensional, 300-dimensional. We'll use the 100D ones.

Let's make a dict mapping words (strings) to their NumPy vector representation:

In [14]:
path_to_glove_file = "glove.6B.100d.txt"

embeddings_index = {}
with open(path_to_glove_file) as f:
    for line in f:
        word, coefs = line.split(maxsplit=1)
        coefs = np.fromstring(coefs, "f", sep=" ")
        embeddings_index[word] = coefs

print("Found %s word vectors." % len(embeddings_index))

Found 400000 word vectors.


Now, let's prepare a corresponding embedding matrix that we can use in a Keras
`Embedding` layer. It's a simple NumPy matrix where entry at index `i` is the pre-trained
vector for the word of index `i` in our `vectorizer`'s vocabulary.

In [15]:
num_tokens = len(voc) + 2
embedding_dim = 100
hits = 0
misses = 0

# Prepare embedding matrix
embedding_matrix = np.zeros((num_tokens, embedding_dim))
for word, i in word_index.items():
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        # Words not found in embedding index will be all-zeros.
        # This includes the representation for "padding" and "OOV"
        embedding_matrix[i] = embedding_vector
        hits += 1
    else:
        misses += 1
print("Converted %d words (%d misses)" % (hits, misses))

Converted 17971 words (2029 misses)


Next, we load the pre-trained word embeddings matrix into an `Embedding` layer.

Note that we set `trainable=False` so as to keep the embeddings fixed (we don't want to
update them during training).

In [16]:
embedding_layer = layers.Embedding(num_tokens, embedding_dim, trainable = False)

embedding_layer.build((1,))
embedding_layer.set_weights([embedding_matrix])

In [17]:
int_sequences_input = keras.Input(shape=(None,), dtype="int32")
embedded_sequences = embedding_layer(int_sequences_input)
x = layers.Conv1D(128, 5, activation="relu")(embedded_sequences)
x = layers.MaxPooling1D(5)(x)
x = layers.Conv1D(128, 5, activation="relu")(x)
x = layers.MaxPooling1D(5)(x)
x = layers.Conv1D(128, 5, activation="relu")(x)
x = layers.GlobalMaxPooling1D()(x)
x = layers.Dense(128, activation="relu")(x)
x = layers.Dropout(0.5)(x)
preds = layers.Dense(len(class_names), activation="softmax")(x)
model = keras.Model(int_sequences_input, preds)
model.summary()

Model: "model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_1 (InputLayer)        [(None, None)]            0         
                                                                 
 embedding (Embedding)       (None, None, 100)         2000200   
                                                                 
 conv1d (Conv1D)             (None, None, 128)         64128     
                                                                 
 max_pooling1d (MaxPooling1  (None, None, 128)         0         
 D)                                                              
                                                                 
 conv1d_1 (Conv1D)           (None, None, 128)         82048     
                                                                 
 max_pooling1d_1 (MaxPoolin  (None, None, 128)         0         
 g1D)                                                        

## Train the model

First, convert our list-of-strings data to NumPy arrays of integer indices. The arrays
are right-padded.

In [None]:
x_train = vectorizer(np.array([[s] for s in train_samples])).numpy()
x_val = vectorizer(np.array([[s] for s in val_samples])).numpy()

y_train = np.array(train_labels)
y_val = np.array(val_labels)

We use categorical crossentropy as our loss since we're doing softmax classification.
Moreover, we use `sparse_categorical_crossentropy` since our labels are integers.

In [None]:
model.compile(
    loss="sparse_categorical_crossentropy", optimizer="rmsprop", metrics=["acc"])
model.fit(x_train, y_train, batch_size=128, epochs=20, validation_data=(x_val, y_val))

## Export an end-to-end model

Now, we may want to export a `Model` object that takes as input a string of arbitrary
length, rather than a sequence of indices. It would make the model much more portable,
since you wouldn't have to worry about the input preprocessing pipeline.

Our `vectorizer` is actually a Keras layer, so it's simple:

In [None]:
string_input = keras.Input(shape=(1,), dtype="string")
x = vectorizer(string_input)
preds = model(x)
end_to_end_model = keras.Model(string_input, preds)

probabilities = end_to_end_model(
    keras.ops.convert_to_tensor(
        [["this message is about computer graphics and 3D modeling"]]
    )
)

print(class_names[np.argmax(probabilities[0])])