<a href="https://colab.research.google.com/github/sowmyarajesh/ML-NLP/blob/main/NLP_SarcasticHeadline.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

## Tokenization and sequencing

First step is tokenize the training text data using Tokenizer() function. Then, convert the sentences to sequence of token. 

**parameters:**

- num_words: maximum number of unique words by frequency. *Default => will consider all the unique words in the data.*

- filters: a string where each element is a character that will be filtered from the texts. *Default => all punctuation, tabs and line breaks, except the ' character.*

- lower: Whether to convert the texts to lowercase. *Default => true*

- split: a string value that can be used as separator for word splitting. *Default => None*

- char_level: should every character in the data be converted to token? *Default => False*

- oov_token: string representing the out-of-vocabulary values. 

- analyzer: Custom analyzer function to split the text. *Default => text_to_word_sequence*


**Methods:**

- fit_on_sequences(sequences):  updates the internal voabulary list based on the sequences. [*Requires sequences_to_matrix() called before this function*]

- fit_on_texts(data): updates the internal vocabulary list based on the words in the data.

- texts_to_sequences(data): Transforms each text in texts to a sequence of integers.

- texts_to_matrix(textArray, mode='binary'): Convert a list of texts to a Numpy matrix. [*Argument mode =>["binary", "count", "tfidf", "freq"]* ]


- get_config(): Returns the tokenizer configuration as Python dictionary.
- to_json(): returns a JSON string containing the tokenizer configuration.
- sequences_to_matrix(sequences, mode='binary'): Converts a list of sequences into a Numpy matrix. [*Argument mode =>[ "binary", "count", "tfidf", "freq"]* ]
- sequences_to_texts(sequences): Converts a list of sequences into a text array.
- sequences_to_texts_generator(sequences): Transforms each sequence in sequences to a list of texts(strings).


In [2]:
Sentences = [
             'My name is Sam',
             'My name is Bob',
             'My Name is Bob',
             'My name is sam!',
             'I like to play baseball',
             'I like go out for walking'
]
# tokenizer to get the first 100 most frequent words
tokenizer = Tokenizer(num_words=100)
tokenizer.fit_on_texts(Sentences)
word_index=tokenizer.word_index
print(word_index)

{'my': 1, 'name': 2, 'is': 3, 'sam': 4, 'bob': 5, 'i': 6, 'like': 7, 'to': 8, 'play': 9, 'baseball': 10, 'go': 11, 'out': 12, 'for': 13, 'walking': 14}


In [3]:
sequences = tokenizer.texts_to_sequences(Sentences)
print(sequences)

[[1, 2, 3, 4], [1, 2, 3, 5], [1, 2, 3, 5], [1, 2, 3, 4], [6, 7, 8, 9, 10], [6, 7, 11, 12, 13, 14]]


When the test data does not have tokened items. creating the sequence will miss in the valaues from the token. 


In [4]:
test_sentence = ['Iam Bob', 'My name is Suman']
tSeq = tokenizer.texts_to_sequences(test_sentence)
print(tSeq)

[[5], [1, 2, 3]]


To avoid missing the words in the sequence, we can use the oov_token parameter in the tokenizer function. This will act as handler for out of vocabulary items. all the out of vocabulary items will be replaced with this value

In [5]:
tokenizer1 = Tokenizer(num_words=100,oov_token="<unknown>")
tokenizer1.fit_on_texts(Sentences)
seq1 = tokenizer1.texts_to_sequences(Sentences)
tSeq1 = tokenizer1.texts_to_sequences(test_sentence)
print(tokenizer.word_index)
print(seq1)
print(tSeq1)

{'my': 1, 'name': 2, 'is': 3, 'sam': 4, 'bob': 5, 'i': 6, 'like': 7, 'to': 8, 'play': 9, 'baseball': 10, 'go': 11, 'out': 12, 'for': 13, 'walking': 14}
[[2, 3, 4, 5], [2, 3, 4, 6], [2, 3, 4, 6], [2, 3, 4, 5], [7, 8, 9, 10, 11], [7, 8, 12, 13, 14, 15]]
[[1, 6], [2, 3, 4, 1]]


#### Padding:

When the number of words in each sentences are of different length, it is hard to make proper 2D matrix of data like image files. 

To handle this situation, pad_sequence is used.

In [6]:
padded_seq1 = pad_sequences(seq1,padding='post')
print(padded_seq1)

[[ 2  3  4  5  0  0]
 [ 2  3  4  6  0  0]
 [ 2  3  4  6  0  0]
 [ 2  3  4  5  0  0]
 [ 7  8  9 10 11  0]
 [ 7  8 12 13 14 15]]


In [7]:
padded_seq2 = pad_sequences(seq1) # defaults to 'pre'
print(padded_seq2)

[[ 0  0  2  3  4  5]
 [ 0  0  2  3  4  6]
 [ 0  0  2  3  4  6]
 [ 0  0  2  3  4  5]
 [ 0  7  8  9 10 11]
 [ 7  8 12 13 14 15]]


In [8]:
padded_seq3 = pad_sequences(seq1,padding='pre')
print(padded_seq3)

[[ 0  0  2  3  4  5]
 [ 0  0  2  3  4  6]
 [ 0  0  2  3  4  6]
 [ 0  0  2  3  4  5]
 [ 0  7  8  9 10 11]
 [ 7  8 12 13 14 15]]


padding value is set to 0 for default. But it can be set to a float or string value using 'value' argument

In [9]:
padded_seq7 = pad_sequences(seq1,padding='post', value=-1)
print(padded_seq7)

[[ 2  3  4  5 -1 -1]
 [ 2  3  4  6 -1 -1]
 [ 2  3  4  6 -1 -1]
 [ 2  3  4  5 -1 -1]
 [ 7  8  9 10 11 -1]
 [ 7  8 12 13 14 15]]


If the padding length is needed to be limited instead of maximum possible sentence length, we will set the maxlen argument of the pad_sequences function. This will truncate the length of sentences to maxlen of words.

In [10]:
padded_seq4 = pad_sequences(seq1,maxlen=4)
print(padded_seq4)

[[ 2  3  4  5]
 [ 2  3  4  6]
 [ 2  3  4  6]
 [ 2  3  4  5]
 [ 8  9 10 11]
 [12 13 14 15]]


In [11]:
padded_seq5= pad_sequences(seq1,maxlen=4, padding='post')
print(padded_seq5)

[[ 2  3  4  5]
 [ 2  3  4  6]
 [ 2  3  4  6]
 [ 2  3  4  5]
 [ 8  9 10 11]
 [12 13 14 15]]


In the above method, the truncation happens only on the prefix irrespective of the padding status. To set the truncation side to posterior, we will have to set 'truncating' property. default is 'pre' 

In [12]:
padded_seq6 = pad_sequences(seq1,maxlen=5, padding='post', truncating='post')
print(padded_seq6)

[[ 2  3  4  5  0]
 [ 2  3  4  6  0]
 [ 2  3  4  6  0]
 [ 2  3  4  5  0]
 [ 7  8  9 10 11]
 [ 7  8 12 13 14]]


**fit_on_sequences(sequences)**

In [20]:
sequences = [
             ['My', 'name', 'is','Sam'],
             ['My', 'name', 'is','Bob'],
             ['I', 'like', 'to','play','baseball'],
             ['I', 'like', 'icecream']
]
seq2 = tokenizer1.sequences_to_matrix(seq1, mode='binary')
seq21 = tokenizer1.fit_on_sequences(seq2)
print(seq2)
print(seq21)

[[0. 0. 1. 1. 1. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0.]
 [0. 0. 1. 1. 1. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0.]
 [0. 0. 1. 1. 1. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0.]
 [0. 0. 1. 1. 1. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.