# Natural Language Processing

## What is NLP?
NLP allows us to work with text data in conjuction with deep learning.

This algorithm allows us to generate new text based on corpus of text data.

Learn more: [NLP](https://en.wikipedia.org/wiki/Natural_language_processing)

## What are the steps needed in order to make a NLP mlmodel?
Let's say we want to use the works of William Shakespeare:
* Step 1: Read in text data. Use python commands to read in text as string data. For realistic results the data set must contain 1 million characters.
* Step 2: Text processing and vectorization. Encode raw string into integers.
* Step 3: Create Batches. Use tensorflow's dataset object to create batches of text sequences
* Step 4: Create a model. We'll use 3 layers. Embedding, GRU, Dense
* Step 5: Train the model. Set up batches and encode character labels
* Step 6: Generate new text

# Step 1

In [5]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf

In [6]:
path_to_file = 'shakespeare.txt'

In [7]:
text = open(path_to_file, 'r').read()

In [8]:
print(text[:670])


                     1
  From fairest creatures we desire increase,
  That thereby beauty's rose might never die,
  But as the riper should by time decease,
  His tender heir might bear his memory:
  But thou contracted to thine own bright eyes,
  Feed'st thy light's flame with self-substantial fuel,
  Making a famine where abundance lies,
  Thy self thy foe, to thy sweet self too cruel:
  Thou that art now the world's fresh ornament,
  And only herald to the gaudy spring,
  Within thine own bud buriest thy content,
  And tender churl mak'st waste in niggarding:
    Pity the world, or else this glutton be,
    To eat the world's due, by the grave and thee.


  


To get the unique characters:

In [9]:
vocab = sorted(set(text))
print(vocab)

['\n', ' ', '!', '"', '&', "'", '(', ')', ',', '-', '.', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', ':', ';', '<', '>', '?', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', '[', ']', '_', '`', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', '|', '}']


In [10]:
len(vocab)

84

# Step 2

In [11]:
char_to_ind = {char:ind for ind, char in enumerate(vocab)}
char_to_ind

{'\n': 0,
 ' ': 1,
 '!': 2,
 '"': 3,
 '&': 4,
 "'": 5,
 '(': 6,
 ')': 7,
 ',': 8,
 '-': 9,
 '.': 10,
 '0': 11,
 '1': 12,
 '2': 13,
 '3': 14,
 '4': 15,
 '5': 16,
 '6': 17,
 '7': 18,
 '8': 19,
 '9': 20,
 ':': 21,
 ';': 22,
 '<': 23,
 '>': 24,
 '?': 25,
 'A': 26,
 'B': 27,
 'C': 28,
 'D': 29,
 'E': 30,
 'F': 31,
 'G': 32,
 'H': 33,
 'I': 34,
 'J': 35,
 'K': 36,
 'L': 37,
 'M': 38,
 'N': 39,
 'O': 40,
 'P': 41,
 'Q': 42,
 'R': 43,
 'S': 44,
 'T': 45,
 'U': 46,
 'V': 47,
 'W': 48,
 'X': 49,
 'Y': 50,
 'Z': 51,
 '[': 52,
 ']': 53,
 '_': 54,
 '`': 55,
 'a': 56,
 'b': 57,
 'c': 58,
 'd': 59,
 'e': 60,
 'f': 61,
 'g': 62,
 'h': 63,
 'i': 64,
 'j': 65,
 'k': 66,
 'l': 67,
 'm': 68,
 'n': 69,
 'o': 70,
 'p': 71,
 'q': 72,
 'r': 73,
 's': 74,
 't': 75,
 'u': 76,
 'v': 77,
 'w': 78,
 'x': 79,
 'y': 80,
 'z': 81,
 '|': 82,
 '}': 83}

In [12]:
char_to_ind['H']

33

In [13]:
ind_to_char = np.array(vocab)
ind_to_char[33]

'H'

In [14]:
encoded_text = np.array([char_to_ind[c] for c in text])
encoded_text.shape

(5445609,)

In [15]:
encoded_text[:670]

array([ 0,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
        1,  1,  1,  1,  1, 12,  0,  1,  1, 31, 73, 70, 68,  1, 61, 56, 64,
       73, 60, 74, 75,  1, 58, 73, 60, 56, 75, 76, 73, 60, 74,  1, 78, 60,
        1, 59, 60, 74, 64, 73, 60,  1, 64, 69, 58, 73, 60, 56, 74, 60,  8,
        0,  1,  1, 45, 63, 56, 75,  1, 75, 63, 60, 73, 60, 57, 80,  1, 57,
       60, 56, 76, 75, 80,  5, 74,  1, 73, 70, 74, 60,  1, 68, 64, 62, 63,
       75,  1, 69, 60, 77, 60, 73,  1, 59, 64, 60,  8,  0,  1,  1, 27, 76,
       75,  1, 56, 74,  1, 75, 63, 60,  1, 73, 64, 71, 60, 73,  1, 74, 63,
       70, 76, 67, 59,  1, 57, 80,  1, 75, 64, 68, 60,  1, 59, 60, 58, 60,
       56, 74, 60,  8,  0,  1,  1, 33, 64, 74,  1, 75, 60, 69, 59, 60, 73,
        1, 63, 60, 64, 73,  1, 68, 64, 62, 63, 75,  1, 57, 60, 56, 73,  1,
       63, 64, 74,  1, 68, 60, 68, 70, 73, 80, 21,  0,  1,  1, 27, 76, 75,
        1, 75, 63, 70, 76,  1, 58, 70, 69, 75, 73, 56, 58, 75, 60, 59,  1,
       75, 70,  1, 75, 63

# Step 3

In [16]:
line = 'From fairest creatures we desire increase'
len(line)

41

In [17]:
lines = '''
From fairest creatures we desire increase,
  That thereby beauty's rose might never die,
  But as the riper should by time decease,
'''
len(lines)

133

In [18]:
seq_len = 120

In [19]:
total_num_seq = len(text) // (seq_len+1)
total_num_seq

45005

In [20]:
char_dataset = tf.data.Dataset.from_tensor_slices(encoded_text)

In [21]:
for item in char_dataset.take(500):
  print(ind_to_char[item.numpy()])



 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1


 
 
F
r
o
m
 
f
a
i
r
e
s
t
 
c
r
e
a
t
u
r
e
s
 
w
e
 
d
e
s
i
r
e
 
i
n
c
r
e
a
s
e
,


 
 
T
h
a
t
 
t
h
e
r
e
b
y
 
b
e
a
u
t
y
'
s
 
r
o
s
e
 
m
i
g
h
t
 
n
e
v
e
r
 
d
i
e
,


 
 
B
u
t
 
a
s
 
t
h
e
 
r
i
p
e
r
 
s
h
o
u
l
d
 
b
y
 
t
i
m
e
 
d
e
c
e
a
s
e
,


 
 
H
i
s
 
t
e
n
d
e
r
 
h
e
i
r
 
m
i
g
h
t
 
b
e
a
r
 
h
i
s
 
m
e
m
o
r
y
:


 
 
B
u
t
 
t
h
o
u
 
c
o
n
t
r
a
c
t
e
d
 
t
o
 
t
h
i
n
e
 
o
w
n
 
b
r
i
g
h
t
 
e
y
e
s
,


 
 
F
e
e
d
'
s
t
 
t
h
y
 
l
i
g
h
t
'
s
 
f
l
a
m
e
 
w
i
t
h
 
s
e
l
f
-
s
u
b
s
t
a
n
t
i
a
l
 
f
u
e
l
,


 
 
M
a
k
i
n
g
 
a
 
f
a
m
i
n
e
 
w
h
e
r
e
 
a
b
u
n
d
a
n
c
e
 
l
i
e
s
,


 
 
T
h
y
 
s
e
l
f
 
t
h
y
 
f
o
e
,
 
t
o
 
t
h
y
 
s
w
e
e
t
 
s
e
l
f
 
t
o
o
 
c
r
u
e
l
:


 
 
T
h
o
u
 
t
h
a
t
 
a
r
t
 
n
o
w
 
t
h
e
 
w
o
r
l
d
'
s
 
f
r
e
s
h
 
o
r
n
a
m
e
n
t
,


 
 
A
n
d
 
o
n
l
y
 
h
e
r
a
l
d
 
t
o
 
t
h
e
 
g
a
u
d
y
 
s
p
r
i
n
g
,


 
 
W
i
t
h
i
n
 
t
h
i
n
e
 
o
w
n
 
b
u


In [22]:
sequences = char_dataset.batch(seq_len+1, drop_remainder=True)

In [23]:
def create_seq_targets(seq):
  input_txt = seq[:-1]
  target_txt = seq[1]
  return input_txt, target_txt

In [24]:
dataset = sequences.map(create_seq_targets)

In [25]:
for input_txt, target_txt in dataset.take(1):
  print(input_txt.numpy())
  print(''.join(ind_to_char[input_txt.numpy()]))
  print('\n')
  print(target_txt.numpy())
  print(''.join(ind_to_char[target_txt.numpy()]))

[ 0  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1 12  0
  1  1 31 73 70 68  1 61 56 64 73 60 74 75  1 58 73 60 56 75 76 73 60 74
  1 78 60  1 59 60 74 64 73 60  1 64 69 58 73 60 56 74 60  8  0  1  1 45
 63 56 75  1 75 63 60 73 60 57 80  1 57 60 56 76 75 80  5 74  1 73 70 74
 60  1 68 64 62 63 75  1 69 60 77 60 73  1 59 64 60  8  0  1  1 27 76 75]

                     1
  From fairest creatures we desire increase,
  That thereby beauty's rose might never die,
  But


1
 


In [26]:
batch_size = 128

In [27]:
buffer_size = 10000

dataset = dataset.shuffle(buffer_size).batch(batch_size, drop_remainder=True)
dataset

<BatchDataset shapes: ((128, 120), (128,)), types: (tf.int64, tf.int64)>

# Step 4

In [28]:
vocab_size = len(vocab)
vocab_size

84

In [29]:
embed_dim = 64

In [30]:
rnn_neurons = 1026

In [31]:
from tensorflow.keras.losses import sparse_categorical_crossentropy

In [32]:
def sparse_cat_loss(y_true, y_pred):
  return sparse_categorical_crossentropy(y_true, y_pred, from_logits=True)

In [33]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, GRU, Dense

In [34]:
def create_model(vocab_size, embed_dim, rnn_neurons, batch_size):
  model = Sequential()

  model.add(Embedding(vocab_size, embed_dim, batch_input_shape=[batch_size, None]))

  model.add(GRU(rnn_neurons, return_sequences=True, stateful=True, recurrent_initializer='glorot_uniform'))

  model.add(Dense(vocab_size))

  model.compile(optimizer='adam', loss=sparse_cat_loss)

  return model

In [35]:
model = create_model(vocab_size=vocab_size, embed_dim=embed_dim, rnn_neurons=rnn_neurons, batch_size=batch_size)

In [36]:
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (128, None, 64)           5376      
_________________________________________________________________
gru (GRU)                    (128, None, 1026)         3361176   
_________________________________________________________________
dense (Dense)                (128, None, 84)           86268     
Total params: 3,452,820
Trainable params: 3,452,820
Non-trainable params: 0
_________________________________________________________________


# Step 5

## GPU
In this model we're dealing with over 3 million parameters and because training rnn takes a lot of time we need to access Colab's GPU to train this
* Go to Notebook Settings
* Select hardware accelerator as GPU

In [37]:
for input_example_batch, target_example_batch in dataset.take(1):

  example_batch_predictions = model(input_example_batch)

In [38]:
example_batch_predictions.shape

TensorShape([128, 120, 84])

In [42]:
sampled_indices = tf.random.categorical(example_batch_predictions[0], num_samples=1)

In [44]:
sampled_indices = tf.squeeze(sampled_indices, axis=-1).numpy()

In [45]:
ind_to_char[sampled_indices]

array(['[', '-', '3', 'r', 'V', ';', '&', '<', 'e', ';', 'K', 'T', '\n',
       'M', '!', "'", '9', 'h', 'S', '!', 'X', 'U', '"', 'p', 'S', '"',
       'K', 'w', 'K', 'H', 'J', ';', '-', '>', 'm', '3', 'Y', '.', ';',
       'Z', 'a', 'P', '2', 't', '2', 'p', 'B', '5', '-', ' ', 'r', 'D',
       'H', 'b', '(', 'e', 'q', '\n', 'j', 'Z', 'x', '_', '.', 'M', 'k',
       '[', '?', 'P', '-', 'y', '(', ',', 'f', 'S', '<', '_', 'k', '5',
       'H', 'A', 'Z', '.', 's', '5', '_', 'X', '0', '>', 'G', 'x', '_',
       'l', 'a', '\n', ';', 'S', 'Y', '6', 'Y', 'r', 'L', 'B', 'u', '\n',
       'H', 'I', 'y', 'V', 'j', 'M', '3', '1', 'W', '}', 'N', '(', 'K',
       'm', 'T', '7'], dtype='<U1')

In [46]:
epochs = 30

In [54]:
def generate_text(model, start_seed, gen_size=500, temp=1.0):
  num_generate = gen_size

  input_eval = [char_to_ind[s] for s in start_seed]

  input_eval = tf.expand_dims(input_eval, 0)

  text_generated = []

  temperature = temp

  model.reset_states()

  for i in range(num_generate):
    predictions = model(input_eval)

    predictions = tf.squeeze(predictions, 0)

    predictions = predictions/temperature

    predicted_id = tf.random.categorical(predictions, num_samples=1)[-1, 0].numpy()

    input_eval = tf.expand_dims([predicted_id], 0)

    text_generated.append(ind_to_char[predicted_id])

    return (start_seed+''.join(text_generated))

In [55]:
print(generate_text(model, 'JULIET', gen_size=1000))

ValueError: ignored