#### In this Notebook, I will be using LSTM(Long Short Term Memory) for Text Classification with Yelp Review Dataset.

#### Main purpose is here to use different kind of **word embedding** along with Neural Network(here LSTM), to see how they affect our overall model accuracy.
---
##### We will be using these four embedding methods:
1. Default Keras Embedding
2. word2vec
3. fastText 
4. GloVe
---
Note: As this dataset is balanced, I am using accuracy as our model evaluation method, also we are using Keras for developing models.

In [None]:
import pandas as pd
import numpy as np
from tensorflow import keras
from tqdm import tqdm
import nltk

In [None]:
print(keras.__version__)



 Here we are going to do binary classification using Yelp Review Sentiment Dataset.
- Dataset Link: [Kaggle Link](https://www.kaggle.com/ilhamfp31/yelp-review-dataset)

In [None]:
train = pd.read_csv('../input/yelp-review-dataset/yelp_review_polarity_csv/train.csv', names = ['sentiment', 'text'] )
test  = pd.read_csv('../input/yelp-review-dataset/yelp_review_polarity_csv/test.csv',  names = ['sentiment', 'text'] )

#### Here, Negative polarity is class 1, and positive class 2.


In [None]:
train = train[:15000]
train.head()

#### Before diving into that, lets do little text cleaning/preprocessing.

In [None]:
# pip install bs4
# from bs4 import BeautifulSoup
# from nltk.corpus import stopwords
# from nltk.tokenize import sent_tokenize, word_tokenize
# from nltk.stem import WordNetLemmatizer 
# import re
# lm=WordNetLemmatizer()


# def ReturnCleanText(text):
#         # change the text into lower case.(Note: in case of social media text, it is good to leave them as it is)
#         text = text.lower()
#         # removing xml tags from tweets
#         text =BeautifulSoup(text, 'lxml').get_text()
#         # removing URLS 
#         text =re.sub('https?://[A-Za-z0-9./]+','',text)
#         # removing words with "@"
#         text =re.sub(r'@[A-Za-z0-9]+','',text)
#         # removing special characters
#         text = re.sub(r"\W+|_", ' ', text)
#         # tokenization of sentences
#         text = word_tokenize(text)
#         # lemmatize the text using WordNetn
#         words = [lm.lemmatize(word) for word in text if word not in set(stopwords.words('english'))]   
#         return " ".join(words)
    
# train['clean_text'] = train['text'].apply(ReturnCleanText)

# LSTM(Long Short Term Memory):
> Long short-term memory is an artificial recurrent neural network architecture used in the field of deep learning. Unlike standard feedforward neural networks, LSTM has feedback connections. It can not only process single data points, but also entire sequences of data.

In [None]:
import tensorflow as tf

from tensorflow.keras import layers
from tensorflow.keras import losses
from tensorflow.keras import preprocessing
from tensorflow.keras import utils
from tensorflow.keras.layers.experimental.preprocessing import TextVectorization


from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.layers import Dense, Embedding, LSTM, SpatialDropout1D, Bidirectional, Dropout

In [None]:
# Just for example
max_features = 2000
Encoder = keras.layers.experimental.preprocessing.TextVectorization( max_tokens = max_features)
Encoder.adapt(train['text'].values)

vocab = np.array(Encoder.get_vocabulary())
print(vocab[:20])

example ="This is an example to test the encoder that we just created!"
print(Encoder(example).numpy())
print(" ".join(vocab[Encoder(example).numpy()]))

In [None]:
max_features = 2000
tokenizer = Tokenizer(num_words = max_features, )
tokenizer.fit_on_texts(train['text'].values)
X = tokenizer.texts_to_sequences(train['text'].values)
X = pad_sequences(X, padding = 'post' ,maxlen=300)
Y = pd.get_dummies(train['sentiment']).values

vocab_size = len(tokenizer.word_index)+1

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X,Y, test_size = 0.25, random_state = 42)
print(X_train.shape,Y_train.shape)
print(X_test.shape,Y_test.shape)

# Training with Keras default Embedding Layer

### Keras Embedding Layer: 

Embedding layers in Keras are trained just like any other layer in your network architecture: they are tuned to minimize the loss function by using the selected optimization method. The major difference with other layers, is that their output is not a mathematical function of the input. Instead the input to the layer is used to index a table with the embedding vectors [1]. However, the underlying automatic differentiation engine has no problem to optimize these vectors to minimize the loss function...

So, you cannot say that the Embedding layer in Keras is doing the same as word2vec [2]. Remember that word2vec refers to a very specific network setup which tries to learn an embedding which captures the semantics of words. With Keras's embedding layer, you are just trying to minimize the loss function, so if for instance you are working with a sentiment classification problem, the learned embedding will probably not capture complete word semantics but just their emotional polarity.

More Here: 
1. https://stats.stackexchange.com/questions/324992/how-the-embedding-layer-is-trained-in-keras-embedding-layer
2. https://stats.stackexchange.com/questions/270546/how-does-keras-embedding-layer-work

In [None]:
embid_dim = 300
lstm_out = 128


model = keras.Sequential()
model.add(Embedding(max_features, embid_dim, input_length = X.shape[1]))
model.add(Bidirectional(LSTM(lstm_out, dropout=0.2)))
model.add(Dense(128, activation = 'relu'))
model.add(Dropout(0.5))
model.add(Dense(64, activation = 'relu'))
model.add(Dense(2, activation = 'softmax'))
model.summary()

In [None]:
batch_size = 128
model.compile(loss = 'categorical_crossentropy', optimizer='adam',metrics = ['accuracy'])
history = model.fit(X_train, Y_train, epochs = 5, batch_size=batch_size, verbose = 1, validation_data =(X_test, Y_test))

# Training with GloVe 300D Embeddings

GloVe is an unsupervised learning algorithm for obtaining vector representations for words. Training is performed on aggregated global word-word co-occurrence statistics from a corpus, and the resulting representations showcase interesting linear substructures of the word vector space.

Link: https://nlp.stanford.edu/projects/glove/

In [None]:
from tqdm import tqdm
embedding_vector = {}
f = open('../input/glove840b300dtxt/glove.840B.300d.txt')
for line in tqdm(f):
    value = line.split(' ')
    word = value[0]
    coef = np.array(value[1:],dtype = 'float32')
    embedding_vector[word] = coef

In [None]:
embedding_matrix = np.zeros((vocab_size,300))
for word,i in tqdm(tokenizer.word_index.items()):
    embedding_value = embedding_vector.get(word)
    if embedding_value is not None:
        embedding_matrix[i] = embedding_value

In [None]:
embedding_matrix.shape

In [None]:
embid_dim = 300
lstm_out = 128


model = keras.Sequential()
model.add(Embedding(vocab_size, embid_dim, input_length =X.shape[1], weights = [embedding_matrix] , trainable = False))
model.add(Bidirectional(LSTM(lstm_out, dropout=0.2)))
model.add(Dense(128, activation = 'relu'))
model.add(Dropout(0.5))
model.add(Dense(64, activation = 'relu'))
model.add(Dense(2, activation = 'softmax'))
model.summary()

In [None]:
batch_size = 128
model.compile(loss = 'categorical_crossentropy', optimizer='adam',metrics = ['accuracy'])
history = model.fit(X_train, Y_train, epochs = 5, batch_size=batch_size, verbose = 1, validation_data =(X_test, Y_test))

# Training with Word2Vec Pre-trained and Trained Embeddings
Reference: https://machinelearningmastery.com/develop-word-embedding-model-predicting-movie-review-sentiment/

**Word2Vec** is not a singular algorithm, rather, it is a family of model architectures and optimizations that can be used to learn word embeddings from large datasets.

1. Continuous Bag-of-Words Model which predicts the middle word based on surrounding context words. The context consists of a few words before and after the current (middle) word. This architecture is called a bag-of-words model as the order of words in the context is not important.
2. Continuous Skip-gram Model which predict words within a certain range before and after the current word in the same sentence. A worked example of this is given below.

Know more here:
1. https://jalammar.github.io/illustrated-word2vec/
2. https://www.tensorflow.org/tutorials/text/word2vec

### Training a Word2Vec Embedding from scratch using Gensim library:

In [None]:
sentences =[]
for t in  tqdm(range(len(train['text']))):
    text = nltk.word_tokenize(train['text'][t])
    sentences.append(text)

##### sg : Either 0 or 1. Default is 0 or CBOW. One must explicitly define Skip-gram by passing 1.

In [None]:
from gensim.models import Word2Vec
w2v_model = Word2Vec(sentences, size=300, min_count=2, sg = 0 )

In [None]:
words = list(w2v_model.wv.vocab)
print('Vocabulary size: %d' % len(words))

# save model 
filename = 'embedding_word2vec.txt'
w2v_model.wv.save_word2vec_format(filename, binary=False)

In [None]:
embedding_vector = {}
f = open('./embedding_word2vec.txt')
for line in tqdm(f):
    value = line.split(' ')
    word = value[0]
    coef = np.array(value[1:],dtype = 'float32')
    embedding_vector[word] = coef

In [None]:
embedding_matrix = np.zeros((vocab_size,300))
for word,i in tqdm(tokenizer.word_index.items()):
    embedding_value = embedding_vector.get(word)
    if embedding_value is not None:
        embedding_matrix[i] = embedding_value    

In [None]:
embid_dim = 300
lstm_out = 128


model = keras.Sequential()
model.add(Embedding(vocab_size, embid_dim, input_length =X.shape[1], weights = [ embedding_matrix] , trainable = False))
model.add(Bidirectional(LSTM(lstm_out, dropout=0.2)))
model.add(Dense(128, activation = 'relu'))
model.add(Dropout(0.5))
model.add(Dense(64, activation = 'relu'))
model.add(Dense(2, activation = 'softmax'))
model.summary()

In [None]:
batch_size = 128
model.compile(loss = 'categorical_crossentropy', optimizer='adam',metrics = ['accuracy'])
history = model.fit(X_train, Y_train, epochs = 50, batch_size=batch_size, verbose = 1, validation_data =(X_test, Y_test))

### Using Pretrained Word2Vec Embedding
Reference: https://www.kaggle.com/jaskarancr/word2vec-traditional-models

In [None]:
from gensim.models import KeyedVectors
filename = '../input/nlpword2vecembeddingspretrained/GoogleNews-vectors-negative300.bin'
w2v_pretrained_model = KeyedVectors.load_word2vec_format(filename, binary=True)

In [None]:
embedding_matrix = np.zeros((vocab_size,300))
for word,i in tqdm(tokenizer.word_index.items()):
    try:
        embedding_value = w2v_pretrained_model[word]
        if embedding_value is not None:
            embedding_matrix[i] = embedding_value         
    except KeyError:
        embedding_matrix[i]=np.random.normal(0,np.sqrt(0.25),300)       

In [None]:
embid_dim = 300
lstm_out = 128


model = keras.Sequential()
model.add(Embedding(vocab_size, 300, input_length =300, weights = [embedding_matrix ] , trainable = False))
model.add(Bidirectional(LSTM(lstm_out, dropout=0.2)))
model.add(Dense(128, activation = 'relu'))
model.add(Dropout(0.5))
model.add(Dense(64, activation = 'relu'))
model.add(Dense(2, activation = 'softmax'))
model.summary()

In [None]:
batch_size = 128
model.compile(loss = 'categorical_crossentropy', optimizer='adam',metrics = ['accuracy'])
history = model.fit(X_train, Y_train, epochs = 5, batch_size=batch_size, verbose = 1, validation_data =(X_test, Y_test))

# Using Pretrained word2vec Embedding with Trainable as True

In [None]:
embid_dim = 300
lstm_out = 128


model = keras.Sequential()
model.add(Embedding(vocab_size, 300, input_length =300, weights = [embedding_matrix ],
                    trainable = True))
model.add(Bidirectional(LSTM(lstm_out, dropout=0.2)))
model.add(Dense(128, activation = 'relu'))
model.add(Dropout(0.5))
model.add(Dense(64, activation = 'relu'))
model.add(Dense(2, activation = 'softmax'))
model.summary()

In [None]:
batch_size = 128
model.compile(loss = 'categorical_crossentropy', optimizer='adam',metrics = ['accuracy'])
history = model.fit(X_train, Y_train, epochs = 5, batch_size=batch_size, verbose = 1, validation_data =(X_test, Y_test))

In this notebook, I have used different word embedding techniques, we can see the validation accuracy with every embedding method, we need to choose one of these according to our problem statement.

Accuracy can be also further increased by changing/updating the embedding dimension, preprocessing of text, different model architecture, batch size and etc.