<a href="https://colab.research.google.com/github/shamim-hussain/sequence_models_and_word_embeddings/blob/main/sequence_models_and_word_embeddings.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!wget https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
!tar -xvf aclImdb_v1.tar.gz

In [2]:
from pathlib import Path

base_path = Path('aclImdb')

train_pos_txt = list(f.read_text(encoding="utf-8") for f in (base_path/'train'/'pos').glob('*.txt'))
train_neg_txt = list(f.read_text(encoding="utf-8") for f in (base_path/'train'/'neg').glob('*.txt'))
test_pos_txt = list(f.read_text(encoding="utf-8") for f in (base_path/'test'/'pos').glob('*.txt'))
test_neg_txt = list(f.read_text(encoding="utf-8") for f in (base_path/'test'/'neg').glob('*.txt'))

print('Positive sample 0:')
print(train_pos_txt[0])
print()
print('Negative sample 0:')
print(train_neg_txt[0])

Positive sample 0:
This movie is perfect for all the romantics in the world. John Ritter has never been better and has the best line in the movie! "Sam" hits close to home, is lovely to look at and so much fun to play along with. Ben Gazzara was an excellent cast and easy to fall in love with. I'm sure I've met Arthur in my travels somewhere. All around, an excellent choice to pick up any evening.!:-)

Negative sample 0:
There are movies like "Plan 9" that are so bad they have a charm about them, there are some like "Waterworld" that have the same inexplicable draw as a car accident, and there are some like "Desperate living" that you hate to admit you love. Cowgirls have none of these redemptions. The cast assembled has enough talent to make almost any plot watchable, and from what I've been told, the book is enjoyable.<br /><br />How then could this movie be so intolerably bad? To begin with, it seems the director brought together a cast of names with no other tie than what will brin

In [3]:
import numpy as np

X_train_txt = train_pos_txt + train_neg_txt
Y_train = np.array([1] * len(train_pos_txt) + [0] * len(train_neg_txt))

X_test_txt = test_pos_txt + test_neg_txt
Y_test = np.array([1] * len(test_pos_txt) + [0] * len(test_neg_txt))

In [4]:
!python -m spacy download en_core_web_sm

import spacy
import re
nlp = spacy.load('en_core_web_sm')
clean_regex = re.compile('<.*?>')  # To remove tags such as <br/></br> 

def text_data_cleaning(sentence):
    sentence = re.sub(clean_regex, ' ', sentence)
    doc = nlp(sentence)
    
    tokens = []
    for token in doc:
        if token.lemma_ != "-PRON-":
            temp = token.lemma_.lower().strip()
        else:
            temp = token.lower_
        tokens.append(temp)
    return tokens


Collecting en_core_web_sm==2.2.5
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.2.5/en_core_web_sm-2.2.5.tar.gz (12.0 MB)
[K     |████████████████████████████████| 12.0 MB 10.9 MB/s 
[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_sm')


In [5]:
print(text_data_cleaning(X_train_txt[0]))

['this', 'movie', 'be', 'perfect', 'for', 'all', 'the', 'romantic', 'in', 'the', 'world', '.', 'john', 'ritter', 'have', 'never', 'be', 'well', 'and', 'have', 'the', 'good', 'line', 'in', 'the', 'movie', '!', '"', 'sam', '"', 'hit', 'close', 'to', 'home', ',', 'be', 'lovely', 'to', 'look', 'at', 'and', 'so', 'much', 'fun', 'to', 'play', 'along', 'with', '.', 'ben', 'gazzara', 'be', 'an', 'excellent', 'cast', 'and', 'easy', 'to', 'fall', 'in', 'love', 'with', '.', 'i', 'be', 'sure', 'i', 'have', 'meet', 'arthur', 'in', 'my', 'travel', 'somewhere', '.', 'all', 'around', ',', 'an', 'excellent', 'choice', 'to', 'pick', 'up', 'any', 'evening.!:-', ')']


In [6]:
import numpy as np
from tqdm import tqdm

try:
    data = np.load('imdb.npz',allow_pickle=True)
    X_train_tokens = data['X_train_tokens']
    X_test_tokens = data['X_test_tokens']
    data.close()
except FileNotFoundError:
    X_train_tokens = []
    for sample in tqdm(X_train_txt, desc='Processing training files'):
        X_train_tokens.append(text_data_cleaning(sample))

    X_test_tokens = []
    for sample in tqdm(X_test_txt, desc='Processing test files'):
        X_test_tokens.append(text_data_cleaning(sample))

    X_train_tokens=np.array(X_train_tokens, dtype=object)
    X_test_tokens=np.array(X_test_tokens, dtype=object)

    np.savez('imdb.npz', 
            X_train_tokens=X_train_tokens,
            Y_train = Y_train,
            X_test_tokens = X_test_tokens,
            Y_test=Y_test)

In [7]:
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

vocab_size = 2000
max_len = 512

tokenizer = Tokenizer(num_words=vocab_size, oov_token='<oov>')
tokenizer.fit_on_texts(X_train_tokens)

X_train = tokenizer.texts_to_sequences(X_train_tokens)
X_test = tokenizer.texts_to_sequences(X_test_tokens)

X_train = pad_sequences(X_train, maxlen=max_len, dtype='int32', padding='post',
                        truncating='post', value=0)
X_test = pad_sequences(X_test, maxlen=max_len, dtype='int32', padding='post',
                        truncating='post', value=0)

In [8]:
print(tokenizer.sequences_to_texts(X_train[0:1]))
print(tokenizer.sequences_to_texts(X_train[-3:-2]))

['red rock west ( <oov> )  <oov> cage get <oov> in a <oov> crime without at first know it , and the <oov> lead to <oov> <oov> , adventure and <oov> in the wild <oov> american west of the <oov> . red rock west be often brutal and sometimes hilarious , and cage pull off the <oov> with his usual <oov> wit and <oov> <oov> .  be the plot over the top ? yes . be <oov> <oov> perfect as a <oov> , almost likable killer ? yes . do cage stand a chance ? well , you have to watch and see . it never let up , and it take me by surprise the first time i see it . on second viewing <oov> , i be surprised at how well it hold up , how well <oov> it be , and how <oov> and funny it be at the same time .  director <oov> <oov> ( who also help write ) be know more for his tv work , but with <oov> and this film he show a <oov> hand with <oov> plot . it be save by its humor by the way , and by the <oov> . the bar be <oov> , the cop <oov> . and do not miss a really inspire cameo by <oov> <oov> as a <oov> <oov> . 

In [9]:
from sklearn.model_selection import train_test_split
X_train,  X_val, Y_train, Y_val = train_test_split(X_train, Y_train,
                                                  test_size=.075, stratify=Y_train,
                                                  shuffle=True,random_state=123)

In [10]:
import os

embedding_dim = 100

embedding_file_path = f'glove.6B.{embedding_dim}d.txt'

if not os.path.exists(embedding_file_path):
    !wget http://nlp.stanford.edu/data/glove.6B.zip
    !unzip -q glove.6B.zip

embeddings_index = {}
with open(embedding_file_path, 'r', encoding="utf-8") as f:
    for line in tqdm(f):
        word, coefs = line.split(maxsplit=1)
        coefs = np.fromstring(coefs, "f", sep=" ")
        embeddings_index[word] = coefs

print("\nFound %s word vectors." % len(embeddings_index))

embedding_matrix = np.random.uniform(size=(vocab_size, embedding_dim)).astype('float32')

for i in range(vocab_size):
    try:
        embedding_matrix[i] = embeddings_index[tokenizer.index_word[i]]
    except KeyError:
        pass

400000it [00:07, 50963.42it/s]


Found 400000 word vectors.





In [16]:
from tensorflow.keras import models, layers, initializers, optimizers, losses, metrics

model_layers = []
model_layers.append(layers.Embedding(vocab_size, 
                               embedding_dim, 
                               embeddings_initializer=initializers.Constant(embedding_matrix), 
                               mask_zero=True,
                               input_shape=[max_len]))
model_layers.append(layers.LSTM(128))
model_layers.append(layers.Dense(1, activation=None))

model = models.Sequential(model_layers, name='lstm_model')

loss = losses.BinaryCrossentropy(from_logits=True)
acc = metrics.BinaryAccuracy(name='acc')
optim = optimizers.Adam(learning_rate=1e-3)

model.compile(optimizer=optim, loss=loss, metrics=[acc])
model.summary()

Model: "lstm_model"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_2 (Embedding)      (None, 512, 100)          200000    
_________________________________________________________________
lstm_2 (LSTM)                (None, 128)               117248    
_________________________________________________________________
dense_2 (Dense)              (None, 1)                 129       
Total params: 317,377
Trainable params: 317,377
Non-trainable params: 0
_________________________________________________________________


In [17]:
model.fit(X_train, Y_train, batch_size=32, epochs=5, validation_data=(X_val,Y_val))

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x7efc000cf0d0>

In [18]:
scores= model.evaluate(X_test, Y_test)
scores



[0.27464592456817627, 0.8914399743080139]

In [19]:
from tensorflow.keras import models, layers, initializers, optimizers, losses, metrics

model_layers = []
model_layers.append(layers.Embedding(vocab_size, 
                               embedding_dim, 
                               embeddings_initializer=initializers.Constant(embedding_matrix), 
                               mask_zero=True,
                               input_shape=[max_len]))
model_layers.append(layers.SimpleRNN(256))
model_layers.append(layers.Dense(1, activation=None))

model = models.Sequential(model_layers, name='lstm_model')

loss = losses.BinaryCrossentropy(from_logits=True)
acc = metrics.BinaryAccuracy(name='acc')
optim = optimizers.Adam(learning_rate=1e-3)

model.compile(optimizer=optim, loss=loss, metrics=[acc])
model.summary()

Model: "lstm_model"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_3 (Embedding)      (None, 512, 100)          200000    
_________________________________________________________________
simple_rnn (SimpleRNN)       (None, 256)               91392     
_________________________________________________________________
dense_3 (Dense)              (None, 1)                 257       
Total params: 291,649
Trainable params: 291,649
Non-trainable params: 0
_________________________________________________________________


In [20]:
model.fit(X_train, Y_train, batch_size=32, epochs=5, validation_data=(X_val,Y_val))

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x7efc01a21750>

In [21]:
loss, acc = model.evaluate(X_test, Y_test)
print(f'Test accuracy = {acc:0.3%}')

Test accuracy = 71.604%
