<a href="https://colab.research.google.com/github/roulupen/nlp-examples/blob/main/Using_Pre_trained_embedding_in_NLP.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Guide to Using Pre-trained Word Embeddings in Natural Language Processing**

Referenc: https://blog.paperspace.com/pre-trained-word-embeddings-natural-language-processing/

In [1]:
!wget https://raw.githubusercontent.com/roulupen/nlp-examples/main/combined_data.csv

--2024-02-10 10:34:26--  https://raw.githubusercontent.com/roulupen/nlp-examples/main/combined_data.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 127831 (125K) [text/plain]
Saving to: ‘combined_data.csv.1’


2024-02-10 10:34:26 (5.78 MB/s) - ‘combined_data.csv.1’ saved [127831/127831]



In [2]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences

from tensorflow.keras.layers import Embedding, LSTM, Dense, Bidirectional
from tensorflow.keras.models import Sequential
from tensorflow.keras.callbacks import EarlyStopping

In [3]:
df = pd.read_csv('./combined_data.csv')

In [4]:
df.shape

(1992, 3)

In [5]:
df.head()

Unnamed: 0.1,Unnamed: 0,text,sentiment
0,0,So there is no way for me to plug it in here i...,0
1,1,Good case Excellent value.,1
2,2,Great for the jawbone.,1
3,3,Tied to charger for conversations lasting more...,0
4,4,The mic is great.,1


In [6]:
X = df['text']
y = df['sentiment']

X_train, X_test , y_train, y_test = train_test_split(X, y , test_size = 0.20)

In [7]:
X_train.shape, X_test.shape

((1593,), (399,))

# **Data preprocessing**

Since this is text data, there are several things you have to to clean it. This includes:
*   Converting all sentences to lowercase
*   Removing all quotation marks
*   Representing all words in some numerical form
*   Removing special characters such as @ and %

All the above can be achieved in TensorFlow using Tokenizer. The class expects a couple of parameters:



*   ***num_words***: the maximum number of words you want to be included in the word index
*   ***oov_token***: the token to be used to represent words that won't be found in the word dictionary. This usually happens when processing the training data. The number 1 is usually used to represent the "out of vocabulary" token ("oov" token)

The ***fit_on_texts*** function is used to fit the Tokenizer on the training set once it has been instantiated with the preferred parameters.



In [8]:
vocab_size = 10000
oov_token = "<OOV>"
tokenizer = Tokenizer(num_words = vocab_size, oov_token=oov_token)
tokenizer.fit_on_texts(X_train)

The ***word_index*** can be used to show the mapping of the words to numbers.

In [9]:
word_index = tokenizer.word_index

In [10]:
word_index

{'<OOV>': 1,
 'the': 2,
 'and': 3,
 'i': 4,
 'a': 5,
 'to': 6,
 'it': 7,
 'is': 8,
 'was': 9,
 'this': 10,
 'of': 11,
 'not': 12,
 'for': 13,
 'my': 14,
 'in': 15,
 'very': 16,
 'with': 17,
 'great': 18,
 'phone': 19,
 'good': 20,
 'that': 21,
 'on': 22,
 'you': 23,
 'have': 24,
 'food': 25,
 'had': 26,
 'service': 27,
 'but': 28,
 'are': 29,
 'place': 30,
 'be': 31,
 'so': 32,
 'as': 33,
 'we': 34,
 'at': 35,
 'all': 36,
 'like': 37,
 'time': 38,
 'quality': 39,
 'back': 40,
 'one': 41,
 'they': 42,
 'were': 43,
 'from': 44,
 'if': 45,
 'would': 46,
 'product': 47,
 'really': 48,
 "don't": 49,
 'well': 50,
 'here': 51,
 'your': 52,
 'also': 53,
 'has': 54,
 'no': 55,
 'will': 56,
 'go': 57,
 'out': 58,
 'battery': 59,
 'just': 60,
 'me': 61,
 'only': 62,
 'ever': 63,
 "it's": 64,
 'works': 65,
 'get': 66,
 'there': 67,
 'an': 68,
 'best': 69,
 'nice': 70,
 'or': 71,
 'up': 72,
 "i've": 73,
 'headset': 74,
 'use': 75,
 'did': 76,
 'our': 77,
 'sound': 78,
 "i'm": 79,
 'love': 80,
 'aft

# **Converting text to sequences**

The next step is to represent each sentiment as a sequence of numbers. This can be done using the **texts_to_sequences** function.

In [11]:
X_train_sequences = tokenizer.texts_to_sequences(X_train)

In [12]:
X_train_sequences[0:5]

[[369, 261, 29, 322, 60, 184, 114],
 [73, 26, 93, 12, 62, 44, 1253, 1254, 632, 1255, 28, 82, 44, 1256, 370],
 [56, 97, 63, 57, 40, 3, 24, 430, 171, 152, 83, 26, 633],
 [4, 9, 32, 634],
 [28, 7, 115, 66, 93, 164, 3, 510, 85, 108, 19, 73, 26, 194]]

Let's do the same for the test set. When you check a sample of the sequence you can see that **words that are not in the vocabulary are represented by 1**.

In [13]:
X_test_sequences = tokenizer.texts_to_sequences(X_test)

In [14]:
X_test_sequences[0:5]

[[36,
  7,
  303,
  9,
  41,
  1038,
  44,
  103,
  2095,
  1,
  1208,
  2,
  2582,
  1,
  3,
  7,
  9,
  1,
  4,
  143,
  12,
  210,
  3,
  4,
  143,
  12,
  1],
 [79, 275, 7, 17, 68, 1, 1, 17, 116, 3, 7, 386, 184],
 [4, 46, 87, 7],
 [627, 1, 501],
 [69, 644, 15, 376, 105, 175]]

# **Padding the sequences**

At the moment, the sequences have different lengths. Usually, you will pass a sequence of the same length to a machine learning model. You therefore have to ensure that all sequences are of the same length. This is done by padding the sequences. Longer sequences will be truncated while shorter ones will be padded with zeros. You will therefore have to declare the truncation and padding type.

Let's start by defining the ***maximum length of each sequence, the padding type, and the truncation type***. A padding and truncation type of "post" means that these operations will take place at the end of the sequence.

In [15]:
max_length = 100
padding_type='post'
truncation_type='post'

In [16]:
X_train_padded = pad_sequences(X_train_sequences,maxlen=max_length, padding=padding_type, truncating=truncation_type)
X_test_padded = pad_sequences(X_test_sequences,maxlen=max_length, padding=padding_type, truncating=truncation_type)

In [17]:
X_train_padded.shape, X_test_padded.shape

((1593, 100), (399, 100))

# **Using GloVe word embeddings**

TensorFlow enables you to train word embeddings. However, this process not only requires a lot of data but can also be time and resource-intensive. To tackle these challenges you can use pre-trained word embeddings. Let's illustrate how to do this using GloVe (Global Vectors) word embeddings by Stanford.  These embeddings are obtained from representing words that are similar in the same vector space. This is to say that words that are negative would be clustered close to each other and so will positive ones.

The first step is to obtain the word embedding and append them to a dictionary. After that, you'll need to create an embedding matrix for each word in the training set. Let's start by downloading the GloVe word embeddings.

In [18]:
!wget http://nlp.stanford.edu/data/glove.6B.zip -O ./glove.6B.zip
!unzip ./glove.6B.zip

--2024-02-10 10:34:34--  http://nlp.stanford.edu/data/glove.6B.zip
Resolving nlp.stanford.edu (nlp.stanford.edu)... 171.64.67.140
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://nlp.stanford.edu/data/glove.6B.zip [following]
--2024-02-10 10:34:34--  https://nlp.stanford.edu/data/glove.6B.zip
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://downloads.cs.stanford.edu/nlp/data/glove.6B.zip [following]
--2024-02-10 10:34:34--  https://downloads.cs.stanford.edu/nlp/data/glove.6B.zip
Resolving downloads.cs.stanford.edu (downloads.cs.stanford.edu)... 171.64.64.22
Connecting to downloads.cs.stanford.edu (downloads.cs.stanford.edu)|171.64.64.22|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 862182613 (822M) [application/zip]
Saving to: ‘./glove.6B.zip’


2

*Next, create that dictionary with those embeddings. Let's work with the glove.6B.100d.tx embeddings. **The 100 in the name is the same as the maximum length chosen for the sequences**.*

In [19]:
embeddings_index = {}
f = open('./glove.6B.100d.txt')
for line in f:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    embeddings_index[word] = coefs
f.close()

print('Found %s word vectors.' % len(embeddings_index))

Found 400000 word vectors.


The next step is to create a word embedding matrix for each word in the word index that you obtained earlier. If a word doesn't have an embedding in GloVe it will be presented with a zero matrix.

In [20]:
embedding_matrix = np.zeros((len(word_index) + 1, max_length))
for word, i in word_index.items():
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        # words not found in embedding index will be all-zeros.
        embedding_matrix[i] = embedding_vector

Here is what the word embedding for the word "shop" looks like.


In [21]:
embeddings_index.get('shop')

array([ 3.0426e-01, -1.4191e-01, -7.9738e-01, -3.5484e-01,  3.0333e-01,
        4.3690e-01, -9.8706e-02,  6.9080e-01,  6.9362e-01,  1.8528e-01,
        1.0648e-01, -4.5209e-01,  8.7568e-01,  1.1414e-01, -2.8514e-01,
        6.0731e-01,  2.7596e-01,  2.3698e-01, -7.1692e-01,  1.6804e-01,
        4.3669e-01,  4.1931e-01,  2.1568e-01, -1.2316e+00,  3.7208e-01,
       -9.0922e-02, -3.8767e-01, -7.0817e-01, -2.4242e-01, -7.2018e-02,
       -3.8969e-01,  5.2464e-01,  2.1317e-01,  8.8327e-02,  6.6017e-04,
        6.7755e-01, -3.3464e-01, -6.1269e-01,  8.2305e-01, -1.4450e+00,
        8.5966e-01, -4.6323e-01, -1.3172e-02, -8.1801e-01,  1.7294e-02,
        1.7025e-01, -6.3946e-01,  4.8516e-01,  6.1706e-01, -3.5333e-01,
       -1.7953e-01,  4.8890e-03, -4.7809e-01,  5.8311e-01, -4.2821e-01,
       -1.7160e+00, -1.3190e+00,  9.0167e-02,  1.3612e+00,  2.2214e-01,
        2.1325e-01,  1.5207e-01,  2.9252e-01,  5.7116e-01, -2.3654e-01,
       -1.4311e-01,  1.2564e+00, -1.6377e-01,  6.9895e-02, -3.28

# **Creating the Keras embedding layer**
---
The next step is to use the embedding you obtained above as the weights to a Keras embedding layer. You also have to set the trainable parameter of this layer to False so that is not trained. If training happens again the weights will be re-initialized. This will be similar to training a word embedding from scratch. There are also a couple of other things to note:

The Embedding layer takes the **first argument as the size of the vocabulary**. 1 is added because 0 is usually reserved for padding
* The **input_length** is the length of the input sequences
* The **output_dim** is the dimension of the dense embedding


In [22]:
embedding_layer = Embedding(input_dim=len(word_index) + 1, output_dim=max_length,
                            weights=[embedding_matrix], input_length=max_length,
                            trainable=False)

# **Creating the TensorFlow model**
---
The next step is to use the embedding layer in a Keras model. Let's define the model as follows:

* The embedding layer as the first layer
* Two Bidirectional LSTM layers to ensure that information flows in both directions
* The fully connected layer, and
* A final layer responsible for the final output



In [23]:
model = Sequential([
    embedding_layer,
    Bidirectional(LSTM(150, return_sequences=True)),
    Bidirectional(LSTM(150)),
    Dense(128, activation='relu'),
    Dense(1, activation='sigmoid')
])

In [24]:
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

In [25]:
callbacks = [EarlyStopping(patience = 10)]

In [26]:
num_epochs = 600
history = model.fit(X_train_padded, y_train, epochs=num_epochs, validation_data=(X_test_padded, y_test),callbacks=callbacks)

Epoch 1/600
Epoch 2/600
Epoch 3/600
Epoch 4/600
Epoch 5/600
Epoch 6/600
Epoch 7/600
Epoch 8/600
Epoch 9/600
Epoch 10/600
Epoch 11/600
Epoch 12/600
Epoch 13/600


In [27]:
loss, accuracy = model.evaluate(X_test_padded,y_test)
print('Test accuracy :', accuracy)

Test accuracy : 0.7769423723220825
