# Word Encodings

**Parsing**: a formal analysis of a sentence into its constituents to produce a parse tree showing their syntactic relation to one another.

**Stemming**: the process of reducing words to their stems, such as part of the word rid of all affixes.

**Text segmentation**: the process of transforming text into meaningful componetes like word, intention, and sentiment.

#### Import libraries and APIs

In [4]:
## import the tensorflow APIs
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.preprocessing.text import Tokenizer

#### Define training sentences

In [6]:
##sentences to tokenize
train_sentences1 = [
             'It is a sunny day',
             'It is a cloudy day',
]

#### Set up the tokenizer

In [8]:
##instantiate the tokenizer
tokenizer1 = Tokenizer(num_words=100)   # Setting this hyper parameter what the organizer will do is take up 
                                        # the top 100 words by volume and just encode those

##train the tokenizer on training sentences
tokenizer1.fit_on_texts(train_sentences1) 

##store word index for the words in the sentence
word_index1 = tokenizer1.word_index

In [9]:
print(word_index1)

{'it': 1, 'is': 2, 'a': 3, 'day': 4, 'sunny': 5, 'cloudy': 6}


## Creating sequences of tokens

#### Define training sentences in a list

In [12]:
# define list of sentences to tokenize
train_sentences2 = [
             'It is a sunny day',
             'It is a cloudy day',
             'Will it rain today?'
]

#### Train the tokenizer

In [14]:
##set up the tokenizer
tokenizer2 = Tokenizer(num_words=100)

##train the tokenizer on training sentences
tokenizer2.fit_on_texts(train_sentences2)

##store word index for the words in the sentence
word_index2 = tokenizer2.word_index

#### Create sequences

In [16]:
##create sequences using tokenizer
sequences = tokenizer2.texts_to_sequences(train_sentences2)

##print word index dictionary and sequences
print(f"Word index -->{word_index2}")
print(f"Sequences of words -->{sequences}")

Word index -->{'it': 1, 'is': 2, 'a': 3, 'day': 4, 'sunny': 5, 'cloudy': 6, 'will': 7, 'rain': 8, 'today': 9}
Sequences of words -->[[1, 2, 3, 5, 4], [1, 2, 3, 6, 4], [7, 1, 8, 9]]


In [17]:
##print sample sentence and sequence
print(train_sentences2[0])
print(sequences[0])

It is a sunny day
[1, 2, 3, 5, 4]


In [18]:
print(train_sentences2[1])
print(sequences[1])

It is a cloudy day
[1, 2, 3, 6, 4]


In [19]:
print(train_sentences2[2])
print(sequences[2])

Will it rain today?
[7, 1, 8, 9]


#### Tokenizing new data using the same tokenizer

In [21]:
new_sentences = [
                 'Will it be raining today?',
                 'It is a pleasant day.'
]

new_sequences = tokenizer2.texts_to_sequences(new_sentences)

print(f"Word index -->{word_index2}")
print()
print(new_sentences)
print(new_sequences)

Word index -->{'it': 1, 'is': 2, 'a': 3, 'day': 4, 'sunny': 5, 'cloudy': 6, 'will': 7, 'rain': 8, 'today': 9}

['Will it be raining today?', 'It is a pleasant day.']
[[7, 1, 9], [1, 2, 3, 4]]


#### Replacing newly encountered words with special values

In [23]:
##set up the tokenizer again with oov_token
tokenizer3 = Tokenizer(num_words=100, oov_token = "<oov>")

##train the new tokenizer on training sentences
tokenizer3.fit_on_texts(train_sentences2)

##store word index for the words in the sentence
word_index3 = tokenizer3.word_index


##create sequences of the new sentences
new_sequences = tokenizer3.texts_to_sequences(new_sentences)

print(word_index3)
print(new_sentences)
print(new_sequences)

{'<oov>': 1, 'it': 2, 'is': 3, 'a': 4, 'day': 5, 'sunny': 6, 'cloudy': 7, 'will': 8, 'rain': 9, 'today': 10}
['Will it be raining today?', 'It is a pleasant day.']
[[8, 2, 1, 1, 10], [2, 3, 4, 1, 5]]


## Padding the sequences

#### Import the APIs

In [26]:
# import the required APIs
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

#### Define the training sentences

In [28]:
train_sentences4 = [
             'It will rain',
             'The weather is cloudy!',
             'Will it be raining today?',
             'It is a super hot day!',
]

#### Train the tokenizer

In [30]:
# set up the tokenizer again with oov_token
tokenizer4 = Tokenizer(num_words=100, oov_token='<oov>')

# train the tokenizer on training sentences
tokenizer4.fit_on_texts(train_sentences4)

# store word index for the words in the sentence
word_index4 = tokenizer4.word_index

#### Create Sequences

In [32]:
# create sequences
sequences4 = tokenizer4.texts_to_sequences(train_sentences4)

#### Pad Sequences

In [34]:
# pad sequences
padded_seqs4 = pad_sequences(sequences4)


print(word_index4)
print()
print(train_sentences4)
print()
print(sequences4)
print()
print(padded_seqs4)

{'<oov>': 1, 'it': 2, 'will': 3, 'is': 4, 'rain': 5, 'the': 6, 'weather': 7, 'cloudy': 8, 'be': 9, 'raining': 10, 'today': 11, 'a': 12, 'super': 13, 'hot': 14, 'day': 15}

['It will rain', 'The weather is cloudy!', 'Will it be raining today?', 'It is a super hot day!']

[[2, 3, 5], [6, 7, 4, 8], [3, 2, 9, 10, 11], [2, 4, 12, 13, 14, 15]]

[[ 0  0  0  2  3  5]
 [ 0  0  6  7  4  8]
 [ 0  3  2  9 10 11]
 [ 2  4 12 13 14 15]]


#### Customising your padded sequence with parameters

In [36]:
## Pad sequences with parameters
padded_seqs = pad_sequences(
                            sequences4,         # Apply padding to the sequences
                            padding="post",     # Add padding to the end of each sequence
                            maxlen=5,           # Ensure each sequence has a maximum length of 5
                            truncating="post",  # Truncate sequences that are longer than maxlen from the end
                            )

# Output the padded sequences
print(padded_seqs)  

[[ 2  3  5  0  0]
 [ 6  7  4  8  0]
 [ 3  2  9 10 11]
 [ 2  4 12 13 14]]


If we look at the padded sequence, you see all of the sequences are all of the length five. And you see the last sentence which had 2, 4 12, 13, 14, 15, as you can see over here. Now the 15 has been truncated from the end because we are using *post truncating*. And all the zeros have been added at the end because we were using *post padding*.

## Sentiment Analysis - Tokenizing news headlines for data preparation

Data preparation include the following steps:

1. Download and read the data
2. Segregate the headlines and their labels.
3. Tokenize the headlines
4. Create sequences and add padding.

#### Download and read the data

In [41]:
import pandas as pd

# Download Kaggle Dataset
data = pd.read_json('data/Sarcasm_Headlines_Dataset.json', lines=True)

data.head()

Unnamed: 0,article_link,headline,is_sarcastic
0,https://www.huffingtonpost.com/entry/versace-b...,former versace store clerk sues over secret 'b...,0
1,https://www.huffingtonpost.com/entry/roseanne-...,the 'roseanne' revival catches up to our thorn...,0
2,https://local.theonion.com/mom-starting-to-fea...,mom starting to fear son's web series closest ...,1
3,https://politics.theonion.com/boehner-just-wan...,"boehner just wants wife to listen, not come up...",1
4,https://www.huffingtonpost.com/entry/jk-rowlin...,j.k. rowling wishes snape happy birthday in th...,0


#### Segregating the headlines

In [43]:
# create lists to store the headlines and labels
headlines = list(data['headline'])
labels = list(data['is_sarcastic'])

headlines[:5]  # First 5 lines

["former versace store clerk sues over secret 'black code' for minority shoppers",
 "the 'roseanne' revival catches up to our thorny political mood, for better and worse",
 "mom starting to fear son's web series closest thing she will have to grandchild",
 'boehner just wants wife to listen, not come up with alternative debt-reduction ideas',
 'j.k. rowling wishes snape happy birthday in the most magical way']

#### Tokenize the data

In [45]:
# import the required APIs
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

In [46]:
# set up the tokenizer
tokenizerc = Tokenizer(oov_token="<oov>")
tokenizerc.fit_on_texts(headlines)

word_indexc = tokenizerc.word_index

#print(word_indexc)

In [47]:
# Print the first 10 elements from word_indexc
first_10_items = list(word_indexc.items())[:10]
for word, index in first_10_items:
    print(f'Word: {word}, Index: {index}')

Word: <oov>, Index: 1
Word: to, Index: 2
Word: of, Index: 3
Word: the, Index: 4
Word: in, Index: 5
Word: for, Index: 6
Word: a, Index: 7
Word: on, Index: 8
Word: and, Index: 9
Word: with, Index: 10


#### Create padded sequences

In [49]:
# create sequences of the headlines
seqsc = tokenizerc.texts_to_sequences(headlines)

# post-pad sequences
padded_seqsc = pad_sequences(seqsc, padding="post")

# printing padded sequence of the first headline 
print(padded_seqsc[0])

[  308 15115   679  3337  2298    48   382  2576 15116     6  2577  8434
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0]
