### **NLP using Tensorflow**

**1. Introduction to NLP fundementals**
* NLP has the goal of derieving information from Natural Language (could be sequences: text or speech)
* Another common term for NLP problems is sequence to sequence problems (seq2seq)

In [2]:
# DL needs
import tensorflow as tf
import keras as kr

# Data needs
import pandas as pd
from sklearn.model_selection import train_test_split

# Numerical computation needs
import numpy as np

# plotting needs
import matplotlib.pyplot as plt
import matplotlib_inline
matplotlib_inline.backend_inline.set_matplotlib_formats('svg')

# ensuring reproducibility
random_seed=42
tf.random.set_seed(random_seed)
import sys

sys.path.append('/home/rudraksha14/Desktop/RAY_RISE_ABOVE_YOURSELF/Programming/tensorflow')
import important_functionalities as impf

**Dataset**
* Kaggle's introduction to NLP dataset (text samples of tweets labelled as disaster or not disaster)
* If dataset is imbalanced, use tensorflow's guide to handle imbalanced data
  https://www.tensorflow.org/tutorials/structured_data/imbalanced_data

**2. Visualizing a text dataset**

In [3]:
train_df=pd.read_csv('train.csv')
test_df=pd.read_csv('test.csv')
train_df.head()

Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1


In [4]:
train_df['text'][1]

'Forest fire near La Ronge Sask. Canada'

In [5]:
# shuffle training dataframe
train_df_shuffled=train_df.sample(frac=1,random_state=random_seed) # frac: percentage of  data to be shuffled
train_df_shuffled.head()

Unnamed: 0,id,keyword,location,text,target
2644,3796,destruction,,So you have a new weapon that can cause un-ima...,1
2227,3185,deluge,,The f$&amp;@ing things I do for #GISHWHES Just...,0
5448,7769,police,UK,DT @georgegalloway: RT @Galloway4Mayor: ÛÏThe...,1
132,191,aftershock,,Aftershock back to school kick off was great. ...,0
6845,9810,trauma,"Montgomery County, MD",in response to trauma Children of Addicts deve...,0


In [6]:
# test dataframe
test_df.head()

Unnamed: 0,id,keyword,location,text
0,0,,,Just happened a terrible car crash
1,2,,,"Heard about #earthquake is different cities, s..."
2,3,,,"there is a forest fire at spot pond, geese are..."
3,9,,,Apocalypse lighting. #Spokane #wildfires
4,11,,,Typhoon Soudelor kills 28 in China and Taiwan


In [7]:
# No of examples of each class
train_df_shuffled.target.value_counts()

target
0    4342
1    3271
Name: count, dtype: int64

In [8]:
# total number of samples
len(train_df_shuffled),len(test_df)

(7613, 3263)

In [9]:
# visualize some random samples
import random
samples=5
random_index = random.randint(0,len(train_df)-samples)
for row in train_df_shuffled[["text","target"]][random_index:random_index+samples+1].itertuples():
    _,text,target = row # underscore is for index, itertuples always returns index
    print(f'Target: {target}',"(real disaster)" if target > 0 else "(not real disaster)")
    print(f'Text:\n{text}\n')
    print('---\n')

Target: 0 (not real disaster)
Text:
Best windows torrent client? was recommended Deluge but it looks like it was written 10 years ago with java swing and 'uses' worse

---

Target: 1 (real disaster)
Text:
California is battling its scariest 2015 wildfire so far. http://t.co/Lec1vmS7x2

---

Target: 1 (real disaster)
Text:
mentions of 'theatre +shooting' on Twitter spike 30min prior to $ckec collapse http://t.co/uuBOvy9GQI

---

Target: 0 (not real disaster)
Text:
trapped in its disappearance

---

Target: 1 (real disaster)
Text:
#Saudi Arabia: #Abha: Fatalities reported following suicide bombing at mosque; avoid area http://t.co/1xW0Z8ZeqW

---

Target: 0 (not real disaster)
Text:
New Ladies Shoulder Tote Handbag Women Cross Body Bag Faux Leather Fashion Purse - Full reÛ_ http://t.co/3PCNtcZoxv http://t.co/n0AkjM1e4B

---



**3. Splitting data into training and validation set**

In [10]:
from sklearn.model_selection import train_test_split

train_sentences,val_sentences,train_labels,val_labels=train_test_split(train_df_shuffled['text'].to_numpy(),train_df_shuffled['target'].to_numpy(),test_size=0.1,random_state=random_seed)

In [11]:
len(train_sentences),len(train_labels),len(val_sentences),len(val_labels)

(6851, 6851, 762, 762)

In [12]:
train_sentences[:10],train_labels[:10]

(array(['@mogacola @zamtriossu i screamed after hitting tweet',
        'Imagine getting flattened by Kurt Zouma',
        '@Gurmeetramrahim #MSGDoing111WelfareWorks Green S welfare force ke appx 65000 members har time disaster victim ki help ke liye tyar hai....',
        "@shakjn @C7 @Magnums im shaking in fear he's gonna hack the planet",
        'Somehow find you and I collide http://t.co/Ee8RpOahPk',
        '@EvaHanderek @MarleyKnysh great times until the bus driver held us hostage in the mall parking lot lmfao',
        'destroy the free fandom honestly',
        'Weapons stolen from National Guard Armory in New Albany still missing #Gunsense http://t.co/lKNU8902JE',
        '@wfaaweather Pete when will the heat wave pass? Is it really going to be mid month? Frisco Boy Scouts have a canoe trip in Okla.',
        'Patient-reported outcomes in long-term survivors of metastatic colorectal cancer - British Journal of Surgery http://t.co/5Yl4DC1Tqt'],
       dtype=object),
 array([0,

**4. Converting text to numbers using tokenization**

**Tokenization vs Embedding**  
* Tokenization (direct mapping from word to number)  
    ```
    "I love tensorflow" → 0 1 2  (I → 0, love → 1, tensorflow → 2) [word level tokenization]
                        → [1,0,0] [0,1,0] [0,0,1]
                                 (I → [1,0,0], love → [0,1,0], tensorflow → [0,0,1]) [One-hot encoding]
    ```
* Embedding (every word turned into a vector)
    ``` 
    "I love tensorflow"→ [0.554,0.847,0.887] > I
                         [0.667,0.112,0.215] > love
                         [0.368,0.846,0.961] > tensorflow
    ```

* Tokenization (can be modelled quickly but gets too big)
  * Word level
  * Character level (converting A to Z to values: 1 to 26)
  * Sub-word (in between character and word level)
* Embedding (richer representations of relationships between tokens (can limit size + can be learnt))
  * Can be used as a layer in Neural network

* what level of embedding should you use? what embedding should I choose?
  * Problem dependent
  * Embeddings --> Word2vec, GloVe embedding 


```
text_vectorizer = tf.keras.layers.TextVectorization(
    max_tokens=None,  # how many words in vocabulary, if None there is no cap on the vocab, it automatically adds <OOV> (Out Of Vocabulary) / Unknown
    standardize='lower_and_strip_punctuation',  # make all letters to lower case and remove punctuations
    split='whitespace',  # split at whitespace
    ngrams=None,  # creates group of n-words (None --> every token is its own)
    output_mode='int',  # how to map tokens/words to numbers
    output_sequence_length=None,  # None --> sets each sequence to longest sequence
    pad_to_max_tokens=True  # padding with zeros to have same length
)```

**creating a tokenizer**

In [13]:
# finding the average number of words/tokens in the training tweets
avg_len=round(sum([len(sentence.split()) for sentence in train_sentences])/len(train_sentences))
print(avg_len)
# setup text vectorization variables
max_vocab_length = 10000 # get the most common 10k words to have in our vocab 
max_len = avg_len # max length of our sequence (eg. how many words from the tweet does our model see?)

# creating the text_vectorizer
text_vectorizer = tf.keras.layers.TextVectorization(
    max_tokens=max_vocab_length,  # how many words in vocabulary, if None there is no cap on the vocab, it automatically adds <OOV> (Out Of Vocabulary) / Unknown
    output_mode='int',  # how to map tokens/words to numbers
    output_sequence_length=max_len,  # None --> sets each sequence to longest sequence
)

15


In [14]:
# mapping the text vectorization layer to text data and turning it into numbers
text_vectorizer.adapt(train_sentences)

In [15]:
# create a sample sentence and tokenize it 
sample_sentence = "There's' a flood in an unexplored region!"
text_vectorizer([sample_sentence])

<tf.Tensor: shape=(1, 15), dtype=int64, numpy=
array([[ 264,    3,  232,    4,   39,    1, 1486,    0,    0,    0,    0,
           0,    0,    0,    0]])>

In [16]:
# choose a random sentence from the training dataset and tokenize it
import random
random_sentence = random.choice(train_sentences)
print(f"Original text:\n{random_sentence}\n\nVectorized version:\n{text_vectorizer([random_sentence])}")

Original text:
It still hasn't sunk in that I've actually met my Idol ????

Vectorized version:
[[  15   80 1202  450    4   16  276  633 5135   13 5393    0    0    0
     0]]


In [17]:
# inspect all words in vocabulary
words_in_vocab = text_vectorizer.get_vocabulary() # get all unique words in training data
top_5_words=words_in_vocab[:5] # get most common words
bottom_5_words=words_in_vocab[-5:] # get least common words

print(f"Number of words in vocabulary:{len(words_in_vocab)}")
print(f'5 most common words: {top_5_words}')
print(f'5 least common words: {bottom_5_words}')

Number of words in vocabulary:10000
5 most common words: ['', '[UNK]', 'the', 'a', 'in']
5 least common words: ['pages', 'paeds', 'pads', 'padres', 'paddytomlinson1']


**creating an embedding**
* Embedding is a rich numerical representation of word which can be learnt during training
* `tf.keras.layers.Embedding()`: turns positive integers(indexes) into dense vectors of fixed size
* vector size can be limited
* The params we care most about our embedding layer
  * `input_dim`: size of our vocabulary
  * `output_dim`: size of output embedding vector (eg: value of 128 --> each token gets represented a vector of size 128) [prefer multiples of 8, which are powers of 2]
  * `input_length`: length of sequences being passed to the embedding layer

In [18]:
embedding = tf.keras.layers.Embedding(input_dim = max_vocab_length,
                                      output_dim = 128,
                                      input_length = max_len,
                                      # embedding_initializer = 'uniform' # default:uniform random nos
                                    )
embedding



<Embedding name=embedding, built=False>

In [19]:
# choose a random sentence from the training dataset and tokenize it
import random
random_sentence = random.choice(train_sentences)
print(f"Original text:\n{random_sentence}\n\nVectorized version:\n{text_vectorizer([random_sentence])}\n\nEmbedded sentence:\n{embedding(text_vectorizer([random_sentence]))}")

Original text:
Growth dries up for BHP Billiton as oil price collapse bites http://t.co/HQoD6v6DnC

Vectorized version:
[[3800    1   27   10    1    1   26  254 1791  155 6095    1    0    0
     0]]

Embedded sentence:
[[[-0.00091185 -0.04793524 -0.04358045 ... -0.00187146  0.02569837
    0.00744072]
  [ 0.02472813  0.03289379 -0.00931381 ... -0.02643679 -0.00477345
    0.04082498]
  [-0.04980818  0.04183486  0.00373161 ... -0.03559046 -0.04574704
   -0.00838652]
  ...
  [-0.0075115  -0.02391864  0.03952919 ...  0.03426281  0.00336608
   -0.02466766]
  [-0.0075115  -0.02391864  0.03952919 ...  0.03426281  0.00336608
   -0.02466766]
  [-0.0075115  -0.02391864  0.03952919 ...  0.03426281  0.00336608
   -0.02466766]]]


In [20]:
# check out a single token's embedding
sample_embed=embedding(text_vectorizer([random_sentence]))
sample_embed

<tf.Tensor: shape=(1, 15, 128), dtype=float32, numpy=
array([[[-0.00091185, -0.04793524, -0.04358045, ..., -0.00187146,
          0.02569837,  0.00744072],
        [ 0.02472813,  0.03289379, -0.00931381, ..., -0.02643679,
         -0.00477345,  0.04082498],
        [-0.04980818,  0.04183486,  0.00373161, ..., -0.03559046,
         -0.04574704, -0.00838652],
        ...,
        [-0.0075115 , -0.02391864,  0.03952919, ...,  0.03426281,
          0.00336608, -0.02466766],
        [-0.0075115 , -0.02391864,  0.03952919, ...,  0.03426281,
          0.00336608, -0.02466766],
        [-0.0075115 , -0.02391864,  0.03952919, ...,  0.03426281,
          0.00336608, -0.02466766]]], dtype=float32)>

In [21]:
sample_embed[0][0],sample_embed[0][0].shape,random_sentence

(<tf.Tensor: shape=(128,), dtype=float32, numpy=
 array([-0.00091185, -0.04793524, -0.04358045,  0.01630894,  0.02740479,
         0.0020316 ,  0.01540469,  0.0362174 , -0.0315143 ,  0.00382898,
        -0.04388655, -0.04705495, -0.04807838, -0.04509239, -0.00234828,
        -0.00650933, -0.01198564,  0.01280477, -0.0239887 , -0.01322567,
         0.03326421,  0.0443283 , -0.03874247,  0.03688614, -0.02234178,
        -0.04203767, -0.0065285 ,  0.03498483,  0.03021096, -0.04685717,
         0.03886629,  0.02953703,  0.03006404,  0.03412286,  0.03662144,
        -0.00015093, -0.01965293,  0.01049066,  0.00700744, -0.03514178,
         0.01366471, -0.04760374,  0.02964156,  0.00788599, -0.03370781,
         0.01256331,  0.0070158 , -0.02092465, -0.04301707, -0.035767  ,
         0.01395755, -0.00969609, -0.04300268, -0.02178118, -0.04004983,
        -0.04253005,  0.00464932, -0.01507258, -0.0363205 , -0.04424253,
        -0.0140479 , -0.03138515, -0.0354749 ,  0.00987401, -0.03429439,
  

**5. Modelling a text-dataset (running series of experiments)**
<br>
|Experiment No.|Model|
|---|---|
|0|Naive Bayes with TF-IDF encoder (baseline)|
|1|Feed forward Neural Network (dense model)|
|2|LSTM (RNN)|
|3|GRU (RNN) |
|4|Bidirectional LSTM (RNN)|
|5|1D convolutional Neural Network|
|6|Tensorflow hub pre-trained feature extractor|
|7|Tensorflow hub pre-trained feature extractor (10% of data)|

* TF-IDF (Term Frequency-Inverse Document Frequency): is a numerical statistic used in information retrieval and natural language processing to evaluate the importance of a word in a document within a collection or corpus
  
* LSTM (Long Short Term Memory)
  
* GRU (Gated Recurrent Unit)

**Note:**
* It is important to create a baseline model, so that we have got a benchmark for future experiments to build upon
* Here we are using sci-kit learn's multinomial Naive Bayes using TF-IDF formula to convert our words to numbers
* It is common practice  to use non-DL algorithm as a baseline because of their speed and later using DL to see if you can improve upon them

**Modelling steps:**
* Create a model
* Build a model
* Fit a model
* Evaluate our model

**6. Model 0 [Naive Bayes with TF-IDF encoder (baseline)]: creation and evaluation**

In [24]:
### Model 0: Getting a baseline 
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline

# create tokenization and modelling pipeline
model_0 = Pipeline([
    ("tfidf",TfidfVectorizer()), # convert words to numbers using tfidf
    ("clf", MultinomialNB()) # model the text (clf: classifier)
])


# fitting the pipeline to the training data
model_0.fit(train_sentences,train_labels)

In [25]:
# evaluate our baseline model (default evaluation metric: accuracy)
baseline_score  = model_0.score(val_sentences,val_labels)
print(f"Our baseline model achieves an accuracy of {baseline_score*100:.2f}%")

Our baseline model achieves an accuracy of 79.27%


In [26]:
# make predictions
baseline_preds=model_0.predict(val_sentences)
baseline_preds[:20]

array([1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1])

**7. Creating an evaluation metric function**
<br>
<br>
*Input:* 
* val_data
* val_labels

<br>

*Returns*:
* accuracy
* precision
* recall
* f1-score

In [28]:
from sklearn.metrics import accuracy_score,precision_recall_fscore_support

def calculate_results(y_true,y_pred):
    '''
    Calculates model accuracy, precision, recall, f1_score for binary classification model
    '''

    # model accuracy
    model_accuracy = accuracy_score(y_true=y_true,y_pred=y_pred)*100

    # model precision, recall, and f1-score using weighted average
    model_precision,model_recall,model_f1_score,support = precision_recall_fscore_support(y_true=y_true,y_pred=y_pred,average='weighted')

    model_results={
        'accuracy': model_accuracy,
        'precision': model_precision,
        'recall': model_recall,
        'f1_score': model_f1_score
    }
    return model_results

# get baseline results:
baseline_results=calculate_results(val_labels,baseline_preds)
baseline_results

None


{'accuracy': 79.26509186351706,
 'precision': 0.8111390004213173,
 'recall': 0.7926509186351706,
 'f1_score': 0.7862189758049549}

***-- CONTD IN NEXT NOTEBOOK --***