# <span style="color:#0b486b">  FIT5215: Deep Learning (2022)</span>
***
*CE/Lecturer:*  **Dr Trung Le** | trunglm@monash.edu <br/> <br/>
*Tutor:*  **Mr Tuan Nguyen**  \[tuan.ng@monash.edu \] |**Mr Anh Bui** \[tuananh.bui@monash.edu\] | **Mr Xiaohao Yang** \[xiaohao.yang@monash.edu \] | **Mr Md Mohaimenuzzaman** \[md.mohaimen@monash.edu \] |**Mr Thanh Nguyen** \[Thanh.Nguyen4@monash.edu \] |
<br/> <br/>
Faculty of Information Technology, Monash University, Australia
******

# <span style="color:#0b486b">Tutorial 09b: RNNs with Word2Vec</span> <span style="color:red">*****</span> #

This tutorial will show you how to use a pretrained Word2Vec to initialize the embedding matrix of RNNs used for a given task for example sentence classification or sentiment analysis. Instead of randomly initializing the embedding matrix, when initializing that matrix using a pretrained Word2Vec, we take advantage of the linguistic/semantic relationships the pretrained Word2Vec drawn from the large text corpus it was trained on (e.g., 100 billion words from a Google News dataset and contains a vocabulary of 3 million words and phrases). 

More specifically, we build up an RNN for *spam SMS detection* for which the embedding matrix is initialized from a pretrained Word2Vec.

We first import some necessary packages and libraries.

In [1]:
import tensorflow as tf
import os
import pandas as pd
from sklearn.preprocessing import LabelEncoder
import numpy as np

## <span style="color:#0b486b">I. Introduction of the SMS spam detection dataset</span> ##

The dataset which we investigate in this tutorial lab is the SMS spam detection dataset. The SMS Spam Collection is a set of SMS tagged messages that have been collected for SMS Spam research. It contains one set of SMS messages in English of 5,574 messages, tagged according to being ham (legitimate) or spam. More information on this dataset can be found [here](https://www.kaggle.com/uciml/sms-spam-collection-dataset). 

## <span style="color:#0b486b">II. Load and preprocess the dataset</span> ##

We create the class *DataManager* as a hub that helps us to load, preprocess, manipulate, and build up the necessary vocabulary and dictionaries (word2idx or idx2word).

In [2]:
class DataManager:
    def __init__(self, url= None):
        self.url = url
        self.max_seq_len = None       # store the max sequence length
        self.num_sentences = None     # store number of sentences 
        self.texts = None             # store all sentences
        self.labels = None            # store all labels
        self.nums_seqs = None         # store sequences of indices 
        self.vocab_size = None
        
    
    def read_data(self, file_path):
        df = pd.read_csv(file_path, encoding = "ISO-8859-1")
        labels, texts = df['v1'], df['v2']
        self.texts= texts
        self.labels = labels    
    
    def transform_to_numbers(self):
        self.tokenizer = tf.keras.preprocessing.text.Tokenizer()
        self.tokenizer.fit_on_texts(self.texts)
        self.nums_seqs = self.tokenizer.texts_to_sequences(self.texts)
        self.nums_seqs = tf.keras.preprocessing.sequence.pad_sequences(self.nums_seqs, padding='post')
        le = LabelEncoder()
        le.fit(self.labels)
        self.nums_labels = le.transform(self.labels) 
        self.max_seq_len = len(self.nums_seqs[0])
        self.num_sentences = len(self.nums_seqs)
        
    def build_vocabulary(self):
        self.word2idx = self.tokenizer.word_index
        self.idx2word = {v:k for k,v in self.word2idx.items()}
        self.vocab_size = len(self.word2idx)
        self.min_index = min(self.word2idx.values())
        self.max_index = max(self.word2idx.values())
        
    def process_data(self):
        self.transform_to_numbers()
        self.build_vocabulary()
        
        
    def train_valid_test_split(self, train_ratio= 0.8, test_ratio=0.1):
        valid_ratio = 1 - (train_ratio + test_ratio)
        train_size = int(self.num_sentences*train_ratio) +1
        test_size = int(self.num_sentences*test_ratio) +1
        valid_size = self.num_sentences - (train_size + test_size)
        data_set = tf.data.Dataset.from_tensor_slices((self.nums_seqs, self.nums_labels))
        data_set = data_set.shuffle(1000)
        self.train_set = data_set.take(train_size)
        self.valid_set = data_set.skip(train_size)
        self.test_set = data_set.skip(train_size + valid_size)
        
    def print_infor(self, num_samples = 5):
        print("Here are some statistics and examples from the dataset")
        if self.num_sentences is not None:
            print("+ Dataset has {} sentences".format(self.num_sentences))
        if self.vocab_size is not None:
            print("+ Vocabulary size is {} with min index= {}, max index= {}".format(self.vocab_size, self.min_index, self.max_index))
        if self.max_seq_len is not None:
            print("+ The max sequence length is {}".format(self.max_seq_len))
        if self.texts is not None:
            print("\nHere are some text samples")
            for i in range(num_samples):
                print("+ Text: {}\n+ Indices: {}\n+ Label: {} ({})\n".format(self.texts[i], self.nums_seqs[i],self.labels[i], self.nums_labels[i]))

In [3]:
dm = DataManager()

In [4]:
dm.read_data("./datasets/spam.csv")

In [5]:
dm.process_data()

In [6]:
dm.print_infor()

Here are some statistics and examples from the dataset
+ Dataset has 5572 sentences
+ Vocabulary size is 8920 with min index= 1, max index= 8920
+ The max sequence length is 189

Here are some text samples
+ Text: Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...
+ Indices: [  50  469 4410  841  751  657   64    8 1324   89  121  349 1325  147
 2987 1326   67   58 4411  144    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    

## <span style="color:#0b486b">III. Build the RNN model</span> ##

In [7]:
import gensim.downloader as api

The class *RNN_Spam_Detection* represents the RNN for SMS spam detection. There are some important attributes (properties or instance variables) of this class:
- `run_mode=scratch or init-fine-tune` specifies the fact we train embedding matrix from scratch or initialize its weights using the pretrained Word2Vect model and then do fine-tuning.
- `embed_model` indicates the pretrained Word2Vect model we use to initialize the embedding matrix. Note that in this case, the embedding size is specified by the number at the end (e.g., glove-wiki-gigaword-300).
- `embed_size` specifies the embedding size and is also the hidden size of the first hidden layer of memory cells. Note that if the running mode is not *scratch*, we set the embedding size as specified by the embedding model.

In [8]:
class RNN_Spam_Detection:
    def __init__(self, run_mode="scratch", embed_model="glove-wiki-gigaword-300", embed_size=128, data_manager=None):
        self.embed_path = "embeddings/E.npy"
        self.embed_model = embed_model
        self.embed_size = embed_size
        if run_mode != 'scratch':
            self.embed_size = int(self.embed_model.split("-")[-1])
        self.data_manager = data_manager
        self.vocab_size = self.data_manager.vocab_size +1  
        self.word2idx = self.data_manager.word2idx
        self.embed_matrix = np.zeros((self.vocab_size, self.embed_size))
        self.run_mode = run_mode
        self.model = None
    
    def build_embedding_matrix(self):
        if os.path.exists(self.embed_path): # file existed
            self.embed_matrix = np.load(self.embed_path) # Load the file for embedding matrix if existed
        else: # file not existed or first-time run
            self.word2vect = api.load(self.embed_model) # load embedding model
            for word, idx in self.word2idx.items():
                try:
                    self.embed_matrix[idx] = self.word2vect.word_vec(word) # assign weight for the corresponding word and index
                except KeyError: # word cannot be found
                    pass
            np.save(self.embed_path, self.embed_matrix)
    
    def build(self):
        inputs = tf.keras.layers.Input(shape=[None])
        if self.run_mode == "scratch":
            self.embedding_layer = tf.keras.layers.Embedding(self.vocab_size, self.embed_size, mask_zero=True, trainable=True)
        else: # fine-tuned
            self.build_embedding_matrix()
            self.embedding_layer = tf.keras.layers.Embedding(self.vocab_size, self.embed_size, mask_zero=True, trainable=True,
                                                        weights=[self.embed_matrix])
        h = self.embedding_layer(inputs)
        h = tf.keras.layers.GRU(256, return_sequences=True)(h)
        h = tf.keras.layers.GRU(128)(h)
        h = tf.keras.layers.Dense(1, activation="sigmoid")(h)
        self.model = tf.keras.Model(inputs= inputs, outputs=h)
    
    def compile_model(self, *args, **kwargs):
        self.model.compile(*args, **kwargs)
    
    def fit(self, *args, **kwargs):
        self.model.fit(*args, **kwargs)
    
    def evaluate(self, *args, **kwargs):
        self.model.evaluate(*args, **kwargs)

### <span style="color:#0b486b">III.1. Run in the running mode of training from scratch</span> ###

We now set random seeds for both numpy and TensorFlow.

In [9]:
tf.random.set_seed(6789)
np.random.seed(6789)

In [10]:
rnn1 = RNN_Spam_Detection(data_manager=dm, run_mode="scratch")

In [11]:
rnn1.build()

In [12]:
rnn1.compile_model(optimizer="sgd", loss="binary_crossentropy", metrics=["accuracy"])

In [13]:
dm.train_valid_test_split()

In [14]:
rnn1.model.fit(dm.train_set.batch(64), epochs=5, validation_data= dm.valid_set.batch(64))

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x2a7cc5902e0>

In [15]:
rnn1.evaluate(dm.test_set.batch(64))



### <span style="color:#0b486b">III.2. Run in the running mode of fine-tuning the embedding matrix</span> ###

In [16]:
rnn2 = RNN_Spam_Detection(data_manager=dm, run_mode="init-fine-tune")

In [17]:
rnn2.build()

In [18]:
rnn2.compile_model(optimizer="sgd", loss="binary_crossentropy", metrics=["accuracy"])

In [19]:
rnn2.model.fit(dm.train_set.batch(64), epochs=5, validation_data= dm.valid_set.batch(64))

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x2a7d9b21130>

In [20]:
rnn2.evaluate(dm.test_set.batch(64))



---
### <span style="color:#0b486b"> <div  style="text-align:center">**THE END**</div> </span>