<a href="https://colab.research.google.com/github/rawatpremsingh999/tensorflow-coursera/blob/master/coursera_3103.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Sentiment in Texts

In [1]:
import tensorflow
from tensorflow import keras
from tensorflow.keras.preprocessing.text import Tokenizer

**Tokenization of sentences**

In [2]:
sentences = ['I Love my parents','I Love my country']

In [3]:
tokenizer = Tokenizer(num_words=100)

In [4]:
tokenizer.fit_on_texts(sentences)

In [5]:
word_index = tokenizer.word_index

In [6]:
print(word_index)

{'i': 1, 'love': 2, 'my': 3, 'parents': 4, 'country': 5}


**Take a look at new set of sentences**

In [7]:
sentences2 = ['I Love my parents','I love my country','i, Love your dog!']
tokenizer.fit_on_texts(sentences2)
word_index = tokenizer.word_index
print(word_index)

{'i': 1, 'love': 2, 'my': 3, 'parents': 4, 'country': 5, 'your': 6, 'dog': 7}


**Get Numerical form of sentences: Encoded into Numbers**

In [8]:
sentences3 = ['I Love my parents',
              'I love my country',
              'i, Love your dog!',
              'Do you also like my dog?',
              'Do you think that my dog is amazing?']

In [9]:
tokenizer = Tokenizer(num_words=100)
tokenizer.fit_on_texts(sentences3)

In [10]:
tokenizer.fit_on_texts(sentences3)

In [11]:
word_index = tokenizer.word_index
print(word_index)

{'my': 1, 'i': 2, 'love': 3, 'dog': 4, 'do': 5, 'you': 6, 'parents': 7, 'country': 8, 'your': 9, 'also': 10, 'like': 11, 'think': 12, 'that': 13, 'is': 14, 'amazing': 15}


In [12]:
sequences = tokenizer.texts_to_sequences(sentences3)
print(sequences) 
# encoded sentences into numerical sequences

[[2, 3, 1, 7], [2, 3, 1, 8], [2, 3, 9, 4], [5, 6, 10, 11, 1, 4], [5, 6, 12, 13, 1, 4, 14, 15]]


Let's take a test sentences

In [13]:
test_sentences = ['Hello! I am prem',
                  'I Love to my country',
                  'I do not have any dog!',
                  'but I love to keep dog in my house']

test_seq = tokenizer.texts_to_sequences(test_sentences)
print(test_seq)

[[2], [2, 3, 1, 8], [2, 5, 4], [2, 3, 4, 1]]


**Looking More at tokenizer**

Add < OOV > token where unseen words are present

In [14]:
tokenizer = Tokenizer(num_words=100,oov_token="<OOV>")

sentences4 = ['I love my Dog',
              'I Love my country!',
              'Dog is a loyal animal',
              'large number of dog in our country']

tokenizer.fit_on_texts(sentences4)
word_index = tokenizer.word_index
print(word_index)

{'<OOV>': 1, 'dog': 2, 'i': 3, 'love': 4, 'my': 5, 'country': 6, 'is': 7, 'a': 8, 'loyal': 9, 'animal': 10, 'large': 11, 'number': 12, 'of': 13, 'in': 14, 'our': 15}


In [15]:
sequences = tokenizer.texts_to_sequences(sentences4)
print(sequences)

[[3, 4, 5, 2], [3, 4, 5, 6], [2, 7, 8, 9, 10], [11, 12, 13, 2, 14, 15, 6]]


In [16]:
test_sentences = ['Hello! I am prem',
                  'I Love to my country',
                  'I do not have any dog!',
                  'but I love to keep dog in my house']

test_seq = tokenizer.texts_to_sequences(test_sentences)
print(test_seq)

[[1, 3, 1, 1], [3, 4, 1, 5, 6], [3, 1, 1, 1, 1, 2], [1, 3, 4, 1, 1, 2, 14, 5, 1]]


**Pad the sequences**

Once the tokenizer create the sequences then we pass to the pad_sequences!

In [17]:
# import padding function
from tensorflow.keras.preprocessing.sequence import pad_sequences

In [18]:
print(sequences)

[[3, 4, 5, 2], [3, 4, 5, 6], [2, 7, 8, 9, 10], [11, 12, 13, 2, 14, 15, 6]]


In [19]:
padded_seq = pad_sequences(sequences)
print(padded_seq)

[[ 0  0  0  3  4  5  2]
 [ 0  0  0  3  4  5  6]
 [ 0  0  2  7  8  9 10]
 [11 12 13  2 14 15  6]]


If we want to pad the sequences after the actual sequence then we can use 'post' parameter in the pad_sequences function

In [20]:
padded_seq_after = pad_sequences(sequences=sequences,padding='post')
print(padded_seq_after)

[[ 3  4  5  2  0  0  0]
 [ 3  4  5  6  0  0  0]
 [ 2  7  8  9 10  0  0]
 [11 12 13  2 14 15  6]]


We can also use max length of padded sequence by setting up mexlen parameter in pad_sequences function and In this we lose the words from beggining values of the sequence if particular sentence length is more than maximum length.

In [21]:
padded_seq_after_maxlen = pad_sequences(sequences,maxlen=5,padding='post')
print(padded_seq_after_maxlen)

[[ 3  4  5  2  0]
 [ 3  4  5  6  0]
 [ 2  7  8  9 10]
 [13  2 14 15  6]]


In [22]:
padded_seq_before_maxlen = pad_sequences(sequences,maxlen=5,padding='pre')
print(padded_seq_before_maxlen)

[[ 0  3  4  5  2]
 [ 0  3  4  5  6]
 [ 2  7  8  9 10]
 [13  2 14 15  6]]


If we set truncating parameter to 'post' then we will loss information from end of the sequence.

In [23]:
padded_seq_endloss = pad_sequences(sequences,maxlen=5,padding='post',truncating='post')
print(padded_seq_endloss)

[[ 3  4  5  2  0]
 [ 3  4  5  6  0]
 [ 2  7  8  9 10]
 [11 12 13  2 14]]


**Padding on Seen Texts**

In [24]:
corpus = ['My country is india',
          'It lies in the continent of Asia.',
          'India is a beautiful country.',
          'Capital of India is New Delhi.',
          'The national language of India is Hindi.']

tokenizer = Tokenizer(num_words=100,oov_token="<OOV>")
tokenizer.fit_on_texts(corpus)
word_index = tokenizer.word_index

corpus_seq = tokenizer.texts_to_sequences(corpus)

padded_corpus = pad_sequences(corpus_seq)

print("Corpus words index: \n",word_index)
print("\nSequences of corpus: \n",corpus_seq)
print("\nPadded Sequences: \n",padded_corpus)

Corpus words index: 
 {'<OOV>': 1, 'is': 2, 'india': 3, 'of': 4, 'country': 5, 'the': 6, 'my': 7, 'it': 8, 'lies': 9, 'in': 10, 'continent': 11, 'asia': 12, 'a': 13, 'beautiful': 14, 'capital': 15, 'new': 16, 'delhi': 17, 'national': 18, 'language': 19, 'hindi': 20}

Sequences of corpus: 
 [[7, 5, 2, 3], [8, 9, 10, 6, 11, 4, 12], [3, 2, 13, 14, 5], [15, 4, 3, 2, 16, 17], [6, 18, 19, 4, 3, 2, 20]]

Padded Sequences: 
 [[ 0  0  0  7  5  2  3]
 [ 8  9 10  6 11  4 12]
 [ 0  0  3  2 13 14  5]
 [ 0 15  4  3  2 16 17]
 [ 6 18 19  4  3  2 20]]


**Padding on Unseen Texts**

In [25]:
test_corpus = ['I love my country!',
               'India is a famous country all over the world.',
               'India is a democratic country.',
               'Peacocks look beautiful in their colorful feathers.',
               'Peacock is the most beautiful creatures of the earth']

test_seq = tokenizer.texts_to_sequences(test_corpus)
print("Test Sequences: \n",test_seq)

pad_seq = pad_sequences(test_seq,maxlen=5)
print("Padded Test Sequences: \n",pad_seq)

Test Sequences: 
 [[1, 1, 7, 5], [3, 2, 13, 1, 5, 1, 1, 6, 1], [3, 2, 13, 1, 5], [1, 1, 14, 10, 1, 1, 1], [1, 2, 6, 1, 14, 1, 4, 6, 1]]
Padded Test Sequences: 
 [[ 0  1  1  7  5]
 [ 5  1  1  6  1]
 [ 3  2 13  1  5]
 [14 10  1  1  1]
 [14  1  4  6  1]]
