***After my previous notebook on Bow and Tf-idf models this notebook takes you through word embeddings and different methods to create word embeddings and using those embeddings with a Sequential model using bidirectional LSTM layer which classifies our text achieving a submission score of >0.8 .***

# Text processing

In [None]:
import pandas as pd
import numpy as np
import tensorflow as tf

In [None]:
train=pd.read_csv('../input/nlp-getting-started/train.csv')
test=pd.read_csv('../input/nlp-getting-started/test.csv')

You can check out my previous notebook [here](https://www.kaggle.com/urstrulysai/bow-tf-idf-models-with-basic-lr-0-80-score) for detailed text pre processing and basic NLP models before going into word embeddings and LSTMs, GRUs, RNNs.

In [None]:
import nltk
nltk.download('punkt')
nltk.download('stopwords')
import re
!pip install contractions
import contractions
from nltk.stem import SnowballStemmer
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')
!pip install pyspellchecker
from spellchecker import SpellChecker

In [None]:
stop_words=nltk.corpus.stopwords.words('english')
i=0
#sc=SpellChecker()
#data=pd.concat([train,test])
wnl=WordNetLemmatizer()
stemmer=SnowballStemmer('english')
for doc in train.text:
    doc=re.sub(r'https?://\S+|www\.\S+','',doc)
    doc=re.sub(r'<.*?>','',doc)
    doc=re.sub(r'[^a-zA-Z\s]','',doc,re.I|re.A)
    #doc=' '.join([stemmer.stem(i) for i in doc.lower().split()])
    doc=' '.join([wnl.lemmatize(i) for i in doc.lower().split()])
    #doc=' '.join([sc.correction(i) for i in doc.split()])
    doc=contractions.fix(doc)
    tokens=nltk.word_tokenize(doc)
    filtered=[token for token in tokens if token not in stop_words]
    doc=' '.join(filtered)
    train.text[i]=doc
    i+=1
i=0
for doc in test.text:
    doc=re.sub(r'https?://\S+|www\.\S+','',doc)
    doc=re.sub(r'<.*?>','',doc)
    doc=re.sub(r'[^a-zA-Z\s]','',doc,re.I|re.A)
    #doc=' '.join([stemmer.stem(i) for i in doc.lower().split()])
    doc=' '.join([wnl.lemmatize(i) for i in doc.lower().split()])
    #doc=' '.join([sc.correction(i) for i in doc.split()])
    doc=contractions.fix(doc)
    tokens=nltk.word_tokenize(doc)
    filtered=[token for token in tokens if token not in stop_words]
    doc=' '.join(filtered)
    test.text[i]=doc
    i+=1

# Word embedding models

spaCy provides 96,200,300-dimensional word embeddings for several languages, which have been learned from large corpora. In other words, each word in the model's vocabulary is represented by a list of 96/200/300 floating point numbers – a vector – and these vectors are embedded into a 96/200/300-dimensional space.
1. en_core_web_sm: Here en is for english and sm is for small i.e 96 dimensional word embeddings
2. en_core_web_md: md is for medium i.e 200 dimensional word embeddings
3. en_core_web_lg: lg is for large i.e 300 dimensional word embeddings

In [None]:
import spacy
nlp=spacy.load('en_core_web_sm')
vecs=np.array([nlp(token).vector for token in train['text']])
train_df=pd.DataFrame(vecs)
test_df=pd.DataFrame(np.array([nlp(token).vector for token in test['text']]))

We can use these word embeddings and fit any classification model to get predictions. If not spacy we have glove embeddings uploaded in kaggle as datasets which can be loaded in directly and be used for predctions with any classification model. GloVe is an unsupervised learning algorithm for obtaining vector representations for words. Training is performed on aggregated global word-word co-occurrence statistics from a corpus, and the resulting representations showcase interesting linear substructures of the word vector space. To get sentence level embeddings we can directly average all the word embeddings of words in that particular sentence so that we can use these for predictions by a classifier model.

1. There are other type of embedding models too such as Word2Vec and FastText. Word2Vec has again two types skipgram model which predicts context words based on target word and CBoW(continous bag of words) model which predicts target words based on context words. 
2. Word2Vec model ignored morphological structure of each word and considers a word as single entity. FastText model considers each word as a bag of character n grams and then takes average of embedding of these n grams. Rare words get good representation using FastText model.

In [None]:
# Robust Word2Vec with gensim
from gensim.models import word2vec

word2vec_model = word2vec.Word2Vec([nltk.word_tokenize(doc) for doc in train.text], #tokenized_corpus
                                 vector_size = 15, # feature size
                                 window = 20, # context window
                                 min_count = 1, # word count
                                 sg = 1, # 1 for skipgram, cbow otherwise
                                 sample = 1e-3, # downsample settling for frequent words
                                 
                                )
#word2vec_model.wv['word'] # returns you the embedding for word
#word2vec_model.wx # returns dictionary of words mapped to their respective embeddings

In [None]:
# Fasttext model

from gensim.models import FastText

FastTextModel = FastText([nltk.word_tokenize(doc) for doc in train.text], #tokenized_corpus
                                 vector_size = 15, # feature size
                                 window = 20, # context window
                                 min_count = 1, # word count
                                 sg = 1, # 1 for skipgram, cbow otherwise
                                 sample = 1e-3, # downsample settling for frequent words
                                 )
#FastTextModel.wv.similarity(w1 = 'word1', w2 = 'word2') # returns how similar two words are
#FastTextModel.wv.doesnt_match(array_of_words) # returns the odd one out the array of words

Below I have loaded in the dataset uploaded by @anindya2906 . In this dataset we have 50,100,200 and also 300 dimensional glove pre trained word embeddings.
Or else we can find the embeddings at the official website [here](https://nlp.stanford.edu/projects/glove/)

In [None]:
# loading in each glove word embedding into a dictionary
dict1={}
with open('../input/glove6b/glove.6B.200d.txt','r') as f:
    for line in f:
        values=line.split()
        word=values[0]
        vectors=np.asarray(values[1:],'float32')
        dict1[word]=vectors
f.close()

In [None]:
tok=tf.keras.preprocessing.text.Tokenizer()

# This class allows to vectorize a text corpus, by turning each text into either a sequence of 
# integers (each integer being the index of a token in a dictionary) or into a vector 
# where the coefficient for each token could be binary, based on word count, based on tf-idf...

tok.fit_on_texts([nltk.word_tokenize(doc) for doc in train.text])
seq_train=tok.texts_to_sequences([nltk.word_tokenize(doc) for doc in train.text])
seq_test=tok.texts_to_sequences([nltk.word_tokenize(doc) for doc in test.text])
pad_train=tf.keras.preprocessing.sequence.pad_sequences(seq_train,maxlen=25,padding='post',truncating='post')
pad_test=tf.keras.preprocessing.sequence.pad_sequences(seq_test,maxlen=25,padding='post',truncating='post')

1. If you want to know more about [tensorflow's tokenizer](https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/text/Tokenizer). 
2. pad_sequences is used to convert each sentence to the same length specified either by padding 0s pre sentence or post sentence till max length is achieved or by truncating if sentence is of greater length than max length.

In [None]:
# embedding matrix is of shape (total words + 1 , dimension of embedding)
# tok.word_index gives us list of words and the len(that list) gives total number of words in all the sentences.

emb_matrix=np.zeros((len(tok.word_index)+1,200))
for word,i in tok.word_index.items():
    if dict1.get(word) is not None:
        emb_matrix[i]=dict1.get(word)

# This embedding matrix will be used as weights in Embedding layer later.

# Model Building

Do go through the below tensorflow tutorials if you want to get a full idea on how RNNs, GRUs, LSTMs and Bidirectional layers work in the background. You will also get an idea of how to use back to back LSTM layers by calling return_sequence=True and use of return_state and much more.
1. [Text classification with an RNN](https://www.tensorflow.org/tutorials/text/text_classification_rnn)
2. [Recurrent Neural Networks (RNN) with Keras](https://www.tensorflow.org/guide/keras/rnn)
3. [Masking and padding with Keras](https://www.tensorflow.org/guide/keras/masking_and_padding)
4. You can find a research paper [here](https://arxiv.org/abs/1512.05287) if you want to know difference between dropout and recurrent dropout in RNNs.

In [None]:
model=tf.keras.Sequential([
    tf.keras.layers.Embedding(len(tok.word_index)+1,200,weights=[emb_matrix],input_length=25,mask_zero=True,trainable=False),
    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(100, dropout=0.2,recurrent_dropout=0.2,return_sequences=True)),
    tf.keras.layers.GlobalMaxPooling1D(), # use if return sequences is set to true
    tf.keras.layers.BatchNormalization(),
    tf.keras.layers.Dropout(0.2),
    tf.keras.layers.Dense(64,activation='relu'),
    tf.keras.layers.Dropout(0.2),
    tf.keras.layers.Dense(32,activation='relu'),
    tf.keras.layers.Dropout(0.2),
    tf.keras.layers.Dense(1,activation='sigmoid')
])

# we can try out different hidden layers including GRUs, RNNs etc whichever gives the best result.
print(model.summary())
model.compile(loss=tf.keras.losses.BinaryCrossentropy(), # as this is a binary classification problem
              optimizer=tf.keras.optimizers.Adam(3e-4),
              metrics=['accuracy'])

In [None]:
checkpoint=tf.keras.callbacks.ModelCheckpoint('model.h5',monitor='val_loss',save_best_only=True)
reduce_lr=tf.keras.callbacks.ReduceLROnPlateau(monitor='val_loss',factor=0.2,patience=2,min_lr=1e-5)
es=tf.keras.callbacks.EarlyStopping(monitor='val_loss',patience=3,restore_best_weights=True)

history=model.fit(pad_train,train.target,batch_size=8,epochs=20,validation_split=0.2,callbacks=[checkpoint,es,reduce_lr])

Now we can use different methos to increase the f1 score. Using StratifiedKFold definitely gives better results, Data Augmentation can also be tried and many such more techniques. We can vary the hyperparams such as batch size, learning rate and see how the model performs.

In [None]:
model.load_weights('./model.h5')
pred=model.predict(pad_test) # else directly use model.predict_classes

In [None]:
pd.DataFrame(np.where(pred>0.5,1,0)).value_counts()

In [None]:
pd.DataFrame({
    'id':test.id,
    'target':np.where(pred>0.5,1,0)[:,0] # you can also vary the threshold 0.5 and see if you get better scores
}).to_csv('submission.csv',index=False)

**Do upvote if you find this helpful in any way. It will help me stay motivated to share many such notebooks and resources. Cheers!!**