# But, what is NLP?

NLP is a branch of AI, that deals with the interaction between computers and humans using the natural language. Most NLP techniques rely on Machine Learning to make sense of natural language. 
The ultimate objective of NLP is to read, decipher, understand and make sense of human languages

Typical NLP use cases - 
* Voice assistants, Ok Google, Siri, Cortana
* Language Translation, Google translate, Bing
* Gmail's spam detection filter
* MS Word's grammatical mistake feature

There are 2 broad approaches of handling NLP problems - Syntactic and Semantic Analysis

1. Syntax - Syntax in a code means the rules that need to followed for that particular programming language. Syntax in a language means the grammatical rules that need to be followed.
      Like POS tagging, lemmatization, stemming etc

2. Semantic - Semantic analytics dive deeper into the language to understand meaning that is conveyed by the text and sentence structure.
      Like NERs, NLU etc


Above is a short summary of this [Article](https://becominghuman.ai/a-simple-introduction-to-natural-language-processing-ea66a1747b32)

From machine learning perspective, NLP is about being as creative as we can get with converting our text to numbers. Why is that? Because ML models don't take text as input. In this notebook, we will use the following approach to break an NLP problem and derive insights from it - 
1. Get the text
2. Convert it into numbers
3. Modelling
4. Predictions


We will start with a baseline TF-IDF model and build on it using DNN, LSTM, GRU, Conv and Transfer learning.

In [None]:
# Import libraries
import pandas as pd
import tensorflow as tf
from sklearn.model_selection import train_test_split
import random
from tensorflow.keras import layers
import datetime

# Let's define Universal Random State
random_state = 42

# For TensorBoard, let's define log storage directory
SAVE_DIR = 'model_logs'

#### The below code block can be used to download data from kaggle directly into Colab

In [None]:
## Data - We will use Kaggle's disaster tweets dataset

# Natural Language Processing with Disaster Tweets

In [None]:
!pip install --upgrade --force-reinstall --no-deps kaggle

Collecting kaggle
[?25l  Downloading https://files.pythonhosted.org/packages/3a/e7/3bac01547d2ed3d308ac92a0878fbdb0ed0f3d41fb1906c319ccbba1bfbc/kaggle-1.5.12.tar.gz (58kB)
[K     |█████▋                          | 10kB 13.2MB/s eta 0:00:01[K     |███████████▏                    | 20kB 18.6MB/s eta 0:00:01[K     |████████████████▊               | 30kB 10.5MB/s eta 0:00:01[K     |██████████████████████▎         | 40kB 7.8MB/s eta 0:00:01[K     |███████████████████████████▉    | 51kB 5.2MB/s eta 0:00:01[K     |████████████████████████████████| 61kB 3.2MB/s 
[?25hBuilding wheels for collected packages: kaggle
  Building wheel for kaggle (setup.py) ... [?25l[?25hdone
  Created wheel for kaggle: filename=kaggle-1.5.12-cp37-none-any.whl size=73053 sha256=865a5811a263e222fb5f4042ce3a226b3ca3f6ff01f91093213a2a142cdbca52
  Stored in directory: /root/.cache/pip/wheels/a1/6a/26/d30b7499ff85a4a4593377a87ecf55f7d08af42f0de9b60303
Successfully built kaggle
Installing collected package

In [None]:
from google.colab import files
files.upload()

Saving kaggle.json to kaggle.json


{'kaggle.json': b'{"username":"suvigyajain","key":"82b0c72e48196f9a45aa86528a87c74d"}'}

In [None]:
! mkdir ~/.kaggle


In [None]:
! cp kaggle.json ~/.kaggle/

In [None]:
! chmod 600 ~/.kaggle/kaggle.json

In [None]:
! kaggle competitions download -c 'nlp-getting-started'

Downloading nlp-getting-started.zip to /content
  0% 0.00/593k [00:00<?, ?B/s]
100% 593k/593k [00:00<00:00, 110MB/s]


In [None]:
! mkdir input_data


In [None]:
! unzip nlp-getting-started.zip -d input_data

Archive:  nlp-getting-started.zip
  inflating: input_data/sample_submission.csv  
  inflating: input_data/test.csv     
  inflating: input_data/train.csv    


### Let's deep dive in the data

Kaggle created 3 files for us - train, test and sample-submission
We will read the files, do some EDA and then move forward

In [None]:
train_df = pd.read_csv('input_data/train.csv')
test_df = pd.read_csv('input_data/test.csv')

print("Train Data Size : ", len(train_df))
print("Test Data Size : ", len(test_df))

Train Data Size :  7613
Test Data Size :  3263


In [None]:
train_df.head()

Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1


In [None]:
test_df.head()

Unnamed: 0,id,keyword,location,text
0,0,,,Just happened a terrible car crash
1,2,,,"Heard about #earthquake is different cities, s..."
2,3,,,"there is a forest fire at spot pond, geese are..."
3,9,,,Apocalypse lighting. #Spokane #wildfires
4,11,,,Typhoon Soudelor kills 28 in China and Taiwan


In [None]:
# Data desc on Kaggle mentions that the location has this distribution - 
                  # USA -> 1%
                  # RoW -> 65%
                  # Null-> 33%

In [None]:
# What about keywords?
train_df[train_df['keyword'].notnull()].head()

Unnamed: 0,id,keyword,location,text,target
31,48,ablaze,Birmingham,@bbcmtd Wholesale Markets ablaze http://t.co/l...,1
32,49,ablaze,Est. September 2012 - Bristol,We always try to bring the heavy. #metal #RT h...,0
33,50,ablaze,AFRICA,#AFRICANBAZE: Breaking news:Nigeria flag set a...,1
34,52,ablaze,"Philadelphia, PA",Crying out for more! Set me ablaze,0
35,53,ablaze,"London, UK",On plus side LOOK AT THE SKY LAST NIGHT IT WAS...,0


In [None]:
# Are key words always part of the text? If so, then can we just ignore those?
train_df[train_df['keyword'].notnull()].apply(lambda x: x.keyword in x.text, axis=1)

31       True
32      False
33       True
34       True
35      False
        ...  
7578     True
7579     True
7580     True
7581     True
7582     True
Length: 7552, dtype: bool

In [None]:
# Not always true!! Let's see in the later stages if we can somehow use this added information

In [None]:
train_df['target'].value_counts()

0    4342
1    3271
Name: target, dtype: int64

Another Binary Classification Problem. Fairly balanced target (60-40)

  1 - Disaster Related Tweet
  
  2 - Not related to Disaster

In [None]:
# Let's print out few 1s and 0s

positive_tweet_sample = train_df[train_df['target'] == 1]['text'].head().tolist()
negative_tweet_sample = train_df[train_df['target'] == 0]['text'].head().tolist()


print("Disaster Related Tweets : ", '\n')
for i in range(5):
  print('\t', positive_tweet_sample[i])
  print('\t', '-'*20)
print('\n')
print("Non-Disaster Related Tweets : ", '\n')
for i in range(5):
  print('\t', negative_tweet_sample[i])
  print('\t', '-'*20)

Disaster Related Tweets :  

	 Our Deeds are the Reason of this #earthquake May ALLAH Forgive us all
	 --------------------
	 Forest fire near La Ronge Sask. Canada
	 --------------------
	 All residents asked to 'shelter in place' are being notified by officers. No other evacuation or shelter in place orders are expected
	 --------------------
	 13,000 people receive #wildfires evacuation orders in California 
	 --------------------
	 Just got sent this photo from Ruby #Alaska as smoke from #wildfires pours into a school 
	 --------------------


Non-Disaster Related Tweets :  

	 What's up man?
	 --------------------
	 I love fruits
	 --------------------
	 Summer is lovely
	 --------------------
	 My car is so fast
	 --------------------
	 What a goooooooaaaaaal!!!!!!
	 --------------------


Train-Test Split (Or Train Validation Split)

Since the testing data doesn't have any labels, we will have to split our training data into train and validation (or test). How about 90-10 for starters

In [None]:
# Define test_size
test_size = 0.1

train_sentences, val_sentences, train_labels, val_labels = train_test_split(train_df['text'].to_numpy(), train_df['target'].to_numpy(),
                                                                            test_size = test_size,
                                                                            random_state = random_state)

In [None]:
# Size of training and validation sets
len(train_sentences), len(train_labels), len(val_sentences), len(val_labels)

(6851, 6851, 762, 762)

## Convert text data into numbers

The next step would be to convert string data into numbers. Our labels are alreay numerical, so need for LabelEncoder, but for tweet text, we surely need to convert

There are 2 major approaches of converting text into numbers - 

* Tokenization - A mapping, or a lookup, for a word/sub-word/character. We can do following types of tokenization - 

      1. Word-Level Tokenization : Assign every word in the sequence it's own token. For example, in the sentence, 'I Love Pizza', we can assign I:0, Love:2, Pizza:3
      2. Character-Level Tokenization : Assign token to every character in the corpus. 
      3. Sub-word-Level Tokenization : This involves breaking words into parts and assigning tokens to each part. 
                    For example, 'I love Pineapple', can be broken
                    into 'I Lo ve pine app le' and then assign 
                    tokens to these sub-words. In case of sub-word
                    tokenization, a word might have multiple tokens

* Embeddings - Embedding is a representation of natural language which can be learned. This representation comes in the form of a Feature vector. For example, the word 'Football' can be represented by a 5-d feature vector : [-0.3456, 0.2352, 0.3454, 0.2576, 0.9865]. Importantly, the size of the feature vector is tune-able. We can use embeddings in 2 ways : 

      1. Create your own embedding : Once your text has been converted to
                      numbers (required for Embedding), you can pass this
                      to Keras Embedding Layer and an embedding
                      representation will be learned during model training
      2. Use a pre-learned embedding : Transfer Learning. 
                      The power of Deep Learning. You can use pre-created
                      embedding layers and fine-tune them on your
                      own purpose. The benefit here is, for example BERT
                      is trained on entire wikipedia. It is not possible
                      that every time we perform a sentiment analysis we
                      train it for months on such a huge corpus. We rather
                      use the results from the previous training and
                      modify it to our purpose.


To tokenize our words, we'll use the helpful preprocessing layer tf.keras.layers.experimental.preprocessing.TextVectorization.

The TextVectorization layer takes the following parameters:

* max_tokens - The maximum number of words in your vocabulary (e.g. 20000 or the number of unique words in your text), includes a value for OOV (out of vocabulary) tokens.
* standardize - Method for standardizing text. Default is "lower_and_strip_punctuation" which lowers text and removes all punctuation marks.
* split - How to split text, default is "whitespace" which splits on spaces.
* ngrams - How many words to contain per token split, for example, ngrams=2 splits tokens into continuous sequences of 2.
* output_mode - How to output tokens, can be "int" (integer mapping), "binary" (one-hot encoding), "count" or "tf-idf". See documentation for more.
* output_sequence_length - Length of tokenized sequence to output. For example, if output_sequence_length=150, all tokenized sequences will be 150 tokens long.
* pad_to_max_tokens - If True (default), the output feature axis will be padded to max_tokens even if the number of unique tokens in the vocabulary is less than max_tokens.



In [None]:
# What's the number of distinct words in the corpus?
from collections import Counter
results = Counter()
train_df['text'].str.lower().str.split().apply(results.update)
print(len(results.keys()))

27983


In [None]:
# Close to 30K if we include test data as well. We'll choose 10K as the vocab_size (or max_tokens)

In [None]:
# Average tweet length?
round(sum([len(i.split()) for i in train_sentences])/len(train_sentences))

15

In [None]:
# We will choose output_sequence_length as 15, since Average Tweet Length is 15

In [None]:
# Initialize Text Vectorizer
from tensorflow.keras.layers.experimental.preprocessing import TextVectorization

vocab_size = 10000
output_len = 15

text_vectorizer = TextVectorization(max_tokens=vocab_size,
                                    output_sequence_length = output_len,
                                    output_mode = 'int')

# Adapt the vectorizer created above to our training data (Adapt is like fit-transform)
text_vectorizer.adapt(train_sentences)

In [None]:
# Let's see how this vectorizer is working on few sample text

sample_corpus = [['I love Pizza'],
                 ['I love Football'],
                 ['The dog loves cricket']]
text_vectorizer(sample_corpus)

<tf.Tensor: shape=(3, 15), dtype=int64, numpy=
array([[   8,  107, 3526,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0],
       [   8,  107, 1528,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0],
       [   2, 1014, 2401, 3964,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0]])>

In [None]:
# So, every output is of length 15. Same word gets same token

In [None]:
# Explore the vectorizer a bit more
words_in_vocab = text_vectorizer.get_vocabulary()
top5_words = words_in_vocab[:5]
bottom5_words = words_in_vocab[-5:]

print("Total numbers of words in the vocab : ", len(words_in_vocab))
print("Top 5 most popular words in the vocab : ", top5_words)
print("Top 5 least popular words in the vocab : ", bottom5_words)

Total numbers of words in the vocab :  10000
Top 5 most popular words in the vocab :  ['', '[UNK]', 'the', 'a', 'in']
Top 5 least popular words in the vocab :  ['pakthey', 'pakistan\x89Ûªs', 'pakistans', 'pajamas', 'paints']


### Create an embedding using Keras Embedding Layer

Next, we convert the tokens to an embedding. One major advantage is that an embedding layer can be train-able, so we can update the tokens if we've obtained from text vectorizer

Main parameters we are looking for - 
1. input_dim = Input array size (Vocab Size in the context)
2. input_length = Length of input sequences being passed to the embedding (15 here)
3. output_dim = Output array size (Map input to this dim. If 100, we get output of size (m, input_len, 100) where m is the number of training examples)

In [None]:
from tensorflow.keras import layers

embedding_output = 128

embedding = layers.Embedding(input_dim = vocab_size,
                             output_dim = embedding_output,
                             input_length = output_len)

In [None]:
# Let's see what exactly this 'thing' is doing
random_sentence = random.choice(train_sentences)
print("Original Text : ", random_sentence)

# Let's embed this random sentence


print("\n\nEmbedded version")
embed_random_sentence = embedding(text_vectorizer([random_sentence]))
embed_random_sentence

Original Text :  http://t.co/wspuXOrEWb  Cindy Noonan@CindyNoonan-Heartbreak in #Baltimore #Rioting #YAHIstorical #UndergroundRailraod


Embedded version


<tf.Tensor: shape=(1, 15, 128), dtype=float32, numpy=
array([[[-8.2419403e-03,  7.0271604e-03, -2.7804350e-02, ...,
         -6.9104135e-05,  3.2737557e-02, -4.0317081e-02],
        [ 3.6743175e-02, -4.0162910e-02,  1.8032704e-02, ...,
         -2.7377367e-02, -4.0613532e-02, -3.3137396e-02],
        [ 2.8106455e-02, -2.7849562e-03,  2.3721445e-02, ...,
         -2.4215544e-02,  6.0081258e-03, -2.2863984e-02],
        ...,
        [ 4.6496391e-03,  4.3651614e-02, -3.4380961e-02, ...,
         -4.9802579e-02,  2.5700960e-02, -3.5095192e-02],
        [ 4.6496391e-03,  4.3651614e-02, -3.4380961e-02, ...,
         -4.9802579e-02,  2.5700960e-02, -3.5095192e-02],
        [ 4.6496391e-03,  4.3651614e-02, -3.4380961e-02, ...,
         -4.9802579e-02,  2.5700960e-02, -3.5095192e-02]]], dtype=float32)>

In [None]:
# Check out a single token's embedding
embed_random_sentence[0][0]

<tf.Tensor: shape=(128,), dtype=float32, numpy=
array([-8.2419403e-03,  7.0271604e-03, -2.7804350e-02,  3.5487283e-02,
        1.3436321e-02,  8.7100267e-04,  9.4576105e-03, -3.1351317e-02,
       -8.3330385e-03,  4.4744160e-02,  2.2176806e-02,  3.9367106e-02,
        2.7689431e-02, -1.4761485e-02, -3.1062020e-02,  3.8930599e-02,
        2.9064763e-02,  1.5715089e-02,  2.3779299e-02,  1.7153110e-02,
       -2.5115967e-02, -8.3158165e-04,  4.8560623e-02,  1.7800417e-02,
        5.3462274e-03,  2.3890410e-02,  1.7804276e-02, -3.5708569e-02,
        1.3284270e-02, -2.6998771e-02,  3.2056320e-02, -4.0175617e-02,
        3.2739725e-02, -5.3801313e-03, -4.1618429e-02,  2.5660362e-02,
        8.5119121e-03,  8.4494725e-03,  8.4378719e-03, -2.1103108e-02,
       -2.9931903e-02,  4.7670010e-02,  1.6261790e-02, -2.0758403e-02,
        2.5988366e-02, -2.3894906e-02, -8.9197978e-03, -4.0400088e-02,
       -3.9870657e-02, -4.9634185e-02, -4.3578278e-02, -1.7457653e-02,
       -4.0914666e-02, -2.049

In [None]:
# So, every token (or a word if we use word-level tokenization) gets mapped into a 128 dimensional space

# What a beautiful matrix. LOL!!

## Modelling

We'll be building the following:

* Model 0: Naive Bayes (baseline)
* Model 1: Feed-forward neural network (dense model)
* Model 2: LSTM model
* Model 3: GRU model
* Model 4: Bidirectional-LSTM model
* Model 5: 1D Convolutional Neural Network
* Model 6: TensorFlow Hub Pretrained Feature Extractor
* Model 7: Same as model 6 with 10% of training data

In [None]:
# Since we will doing a lot of experimentation, it's a good idea we create some base functions
# Following function evaluates: accuracy, precision, recall, f1-score
from sklearn.metrics import accuracy_score, precision_recall_fscore_support

def calculate_results(y_true, y_pred):
  """
  Calculates model accuracy, precision, recall and f1 score of a binary classification model.

  Args:
  -----
  y_true = true labels in the form of a 1D array
  y_pred = predicted labels in the form of a 1D array

  Returns a dictionary of accuracy, precision, recall, f1-score.
  """
  # Calculate model accuracy
  model_accuracy = accuracy_score(y_true, y_pred) * 100
  # Calculate model precision, recall and f1 score using "weighted" average
  model_precision, model_recall, model_f1, _ = precision_recall_fscore_support(y_true, y_pred, average="weighted")
  model_results = {"accuracy": model_accuracy,
                  "precision": model_precision,
                  "recall": model_recall,
                  "f1": model_f1}
  return model_results

def create_tensorboard_callback(dir_name, experiment_name):
  """
  Creates a TensorBoard callback instand to store log files.
  Stores log files with the filepath:
    "dir_name/experiment_name/current_datetime/"
  Args:
    dir_name: target directory to store TensorBoard log files
    experiment_name: name of experiment directory (e.g. efficientnet_model_1)
  """
  log_dir = dir_name + "/" + experiment_name + "/" + datetime.datetime.now().strftime("%Y%m%d-%H%M%S")
  tensorboard_callback = tf.keras.callbacks.TensorBoard(
      log_dir=log_dir
  )
  print(f"Saving TensorBoard log files to: {log_dir}")
  return tensorboard_callback

### Model 0 - Baseline NB Model

Just like TF, Scikit Learn models also don't take strings as inputs (DUHH!!)

So, we will use [TF-IDF](https://en.wikipedia.org/wiki/Tf%E2%80%93idf) score for each word to convert text to number

Also, we will Sklearn pipeline to get the results

But before that, a bit about TF-IDF. TF-IDF evaluated how relevant a word is in a collection of documents. It has 2 terms - 

1. Term Frequency (TF) - This measures the frequency of a word in a doc. We normalise the effect of the words by the total number of words in the doc.

    > TF = Freq of word W in Doc D / Total number of words in D

2. Inverse Document Frequency (IDF) - This measures the importance of a document in the corpus. We are only interested in the number of doc the word W is present and NOT IN THE FREQUENCY.

    > IDF Definition = Inverse of (Number of docs in which word W occurs / Total number of Docs in Corpus). However, if number of docs in which W occurs is zero, then we might get DivisionByZeroError
    
So, IDF is defined as - 

  > IDF = log(N / (df + 1)), where df is the number of docs in which W occurs

So, putting it all together, TF-IDF score for a word W is -
  
    TFIDF = tf(w, d) * log(N / (df + 1)
            where w : word for which TFIDF is being calculated
                  d : current document
                 df : number of docs in which w occurs
                  N : Total number of docs




In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline

# Create tokenization and MNB pipeline
model_0 = Pipeline([
                    ('tfidf', TfidfVectorizer()),
                    ('clf', MultinomialNB())
])

# Fit the pipeline to the training data
model_0.fit(train_sentences, train_labels)

# Make predictions
baseline_preds = model_0.predict(val_sentences)
print("Few predictions : ", baseline_preds[:10])

# Evaluate the model on Testing data
baseline_score = model_0.score(val_sentences, val_labels)
print(f"Our baseline model achieves an accuracy of: {baseline_score*100:.2f}%")

# Evaluation metrics
baseline_results = calculate_results(y_true=val_labels,
                                     y_pred=baseline_preds)
baseline_results

Few predictions :  [0 0 0 0 0 0 0 0 0 1]
Our baseline model achieves an accuracy of: 77.82%


{'accuracy': 77.82152230971128,
 'f1': 0.7703527809038113,
 'precision': 0.792992256322435,
 'recall': 0.7782152230971129}

### Model 1 - A simple Dense Model

Too simple I'd say. Basically we are taking text and labels as input, do tokenization, create embedding, convert embedding into lower dimension (like avg) and then pass it to 1 fully connected layer

In [None]:
# We have 1-d input. Tweet as a string. 
inputs = layers.Input(shape=(1,), dtype = 'string')
# Tokenize the text
x = text_vectorizer(inputs)
# Create embeddings using embedding layer created above
x = embedding(x)
# Convert to lower dimesnion, using Average
x = layers.GlobalAveragePooling1D()(x)

# Create outputs - since the output is Binary, we'll use Sigmoid
outputs = layers.Dense(1, activation = 'sigmoid')(x)

# Create the model
model_1 = tf.keras.Model(inputs, outputs, name = 'model_1_dense')

# Compile the model
model_1.compile(loss = 'binary_crossentropy',
                optimizer = tf.keras.optimizers.Adam(),
                metrics = ['accuracy'])

# Get a short summary of the model
model_1.summary()

# Fit the model
model_1_history = model_1.fit(train_sentences,
                              train_labels,
                              epochs = 5,
                              validation_data = (val_sentences, val_labels),
                              callbacks = [create_tensorboard_callback(dir_name = SAVE_DIR,
                                                                       experiment_name = 'Simple_Dense_Model')])

# Evaluate the model
model_1.evaluate(val_sentences, val_labels)

Model: "model_1_dense"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_3 (InputLayer)         [(None, 1)]               0         
_________________________________________________________________
text_vectorization (TextVect (None, 15)                0         
_________________________________________________________________
embedding (Embedding)        (None, 15, 128)           1280000   
_________________________________________________________________
global_average_pooling1d_2 ( (None, 128)               0         
_________________________________________________________________
dense_2 (Dense)              (None, 1)                 129       
Total params: 1,280,129
Trainable params: 1,280,129
Non-trainable params: 0
_________________________________________________________________
Saving TensorBoard log files to: model_logs/Simple_Dense_Model/20210620-152005
Epoch 1/5
Epoch 2/5
Epoch 3/5


[0.5013892650604248, 0.7847769260406494]