# Natural language processing
---

In computer science, we refer to human languages, like English or Mandarin, as
“natural” languages, to distinguish them from languages that were designed for
machines, like Assembly, LISP, or XML. Every machine language was designed: its
starting point was a human engineer writing down a set of formal rules to describe what statements you could make in that language and what they meant. 

Rules came first, and people only started using the language once the rule set was complete. With human language, it’s the reverse: usage comes first, rules arise later.

Creating algorithms that can make sense of natural language is a big deal. The internet is mostly text. Language is how we store almost all of our knowledge. Our very thoughts are largely built upon language. However, the ability to understand natural language has long eluded machines. 

Modern NLP is about using machine learning and large datasets to
give computers the ability not to understand language, but
to ingest a piece of language as input and return something useful, like predicting the following:

- “What’s the topic of this text?” (text classification)
- “Does this text contain abuse?” (content filtering)
- “Does this text sound positive or negative?” (sentiment analysis)
- “What should be the next word in this incomplete sentence?” (language modeling)
- “How would you say this in German?” (translation)
- “How would you summarize this article in one paragraph?” (summarization)

### Preparing text data

Deep learning models, being differentiable functions, can only process numeric tensors: they can’t take raw text as input. Vectorizing text is the process of transforming text into numeric tensors:

- First, you standardize the text to make it easier to process, such as by converting it to lowercase or removing punctuation.
- You split the text into units (called tokens), such as characters, words, or groups of words. This is called tokenization.
- You convert each such token into a numerical vector. This will usually involve
first indexing all tokens present in the data.

We are going to process the book Frankenstein, which is available from Project Gutenberg. [Here is a list of the most popular books on the site.](https://www.gutenberg.org/ebooks/search/%3Fsort_order%3Ddownloads)



In [1]:
!wget https://www.gutenberg.org/files/84/84-0.txt
!mv 84-0.txt frankenstein.txt

--2022-07-25 16:26:34--  https://www.gutenberg.org/files/84/84-0.txt
Resolving www.gutenberg.org (www.gutenberg.org)... 152.19.134.47, 2610:28:3090:3000:0:bad:cafe:47
Connecting to www.gutenberg.org (www.gutenberg.org)|152.19.134.47|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 448821 (438K) [text/plain]
Saving to: ‘84-0.txt’


2022-07-25 16:26:34 (5.82 MB/s) - ‘84-0.txt’ saved [448821/448821]



Project Gutenberg adds a standard header and footer to each book and this is not part of the original text. We have to identify the header and footer, and remove them before processing the text.

In [2]:
!head -30 frankenstein.txt
!echo "-----------------"
!tail -355 frankenstein.txt | head -10

﻿The Project Gutenberg eBook of Frankenstein, by Mary Wollstonecraft (Godwin) Shelley

This eBook is for the use of anyone anywhere in the United States and
most other parts of the world at no cost and with almost no restrictions
whatsoever. You may copy it, give it away or re-use it under the terms
of the Project Gutenberg License included with this eBook or online at
www.gutenberg.org. If you are not located in the United States, you
will have to check the laws of the country where you are located before
using this eBook.

Title: Frankenstein
       or, The Modern Prometheus

Author: Mary Wollstonecraft (Godwin) Shelley

Release Date: 31, 1993 [eBook #84]
[Most recently updated: November 13, 2020]

Language: English

Character set encoding: UTF-8

Produced by: Judith Boss, Christy Phillips, Lynn Hanninen, and David Meltzer. HTML version by Al Haines.
Further corrections by Menno de Leeuw.

*** START OF THE PROJECT GUTENBERG EBOOK FRANKENSTEIN ***




---

In [3]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import random

from string import punctuation

from tensorflow.keras import models, layers, optimizers
from tensorflow.keras.callbacks import ModelCheckpoint
from tensorflow.keras.utils import to_categorical

In [4]:
lines = []
with open('frankenstein.txt') as f:
    lines = [line for line in f]

lines = lines[30:-355]

# We need to standardize the text, so we use letters in lower case, and remove punctuation signs.
raw_text = ''.join(lines).lower()
raw_text = raw_text.translate(str.maketrans("", "", punctuation))

print(raw_text[:5000])

frankenstein

or the modern prometheus

by mary wollstonecraft godwin shelley


 contents

 letter 1
 letter 2
 letter 3
 letter 4
 chapter 1
 chapter 2
 chapter 3
 chapter 4
 chapter 5
 chapter 6
 chapter 7
 chapter 8
 chapter 9
 chapter 10
 chapter 11
 chapter 12
 chapter 13
 chapter 14
 chapter 15
 chapter 16
 chapter 17
 chapter 18
 chapter 19
 chapter 20
 chapter 21
 chapter 22
 chapter 23
 chapter 24




letter 1

to mrs saville england


st petersburgh dec 11th 17—


you will rejoice to hear that no disaster has accompanied the
commencement of an enterprise which you have regarded with such evil
forebodings i arrived here yesterday and my first task is to assure
my dear sister of my welfare and increasing confidence in the success
of my undertaking

i am already far north of london and as i walk in the streets of
petersburgh i feel a cold northern breeze play upon my cheeks which
braces my nerves and fills me with delight do you understand this
feeling this breeze which has trav

With the raw text in lower case, and without punctuation, we must prepare the data for modeling by the neural network. We cannot model the characters directly, instead we must convert the characters to integers.

In [6]:
# Create a dictionary of characters to integers
chars = sorted(list(set(raw_text)))
char_to_int = dict((c, i) for i, c in enumerate(chars))
int_to_char = dict((i, c) for i, c in enumerate(chars))

print(char_to_int)
print(int_to_char)

n_vocab = len(char_to_int)
n_chars = len(raw_text)

print(f"Number of distinct characters: {n_vocab}\nTotal number of characters: {n_chars}")

{'\n': 0, ' ': 1, '0': 2, '1': 3, '2': 4, '3': 5, '4': 6, '5': 7, '6': 8, '7': 9, '8': 10, '9': 11, 'a': 12, 'b': 13, 'c': 14, 'd': 15, 'e': 16, 'f': 17, 'g': 18, 'h': 19, 'i': 20, 'j': 21, 'k': 22, 'l': 23, 'm': 24, 'n': 25, 'o': 26, 'p': 27, 'q': 28, 'r': 29, 's': 30, 't': 31, 'u': 32, 'v': 33, 'w': 34, 'x': 35, 'y': 36, 'z': 37, 'æ': 38, 'è': 39, 'é': 40, 'ê': 41, 'ô': 42, '—': 43, '‘': 44, '’': 45, '“': 46, '”': 47}
{0: '\n', 1: ' ', 2: '0', 3: '1', 4: '2', 5: '3', 6: '4', 7: '5', 8: '6', 9: '7', 10: '8', 11: '9', 12: 'a', 13: 'b', 14: 'c', 15: 'd', 16: 'e', 17: 'f', 18: 'g', 19: 'h', 20: 'i', 21: 'j', 22: 'k', 23: 'l', 24: 'm', 25: 'n', 26: 'o', 27: 'p', 28: 'q', 29: 'r', 30: 's', 31: 't', 32: 'u', 33: 'v', 34: 'w', 35: 'x', 36: 'y', 37: 'z', 38: 'æ', 39: 'è', 40: 'é', 41: 'ê', 42: 'ô', 43: '—', 44: '‘', 45: '’', 46: '“', 47: '”'}
Number of distinct characters: 48
Total number of characters: 409737


We need to create the sequences that are going to be fed to the neural network, as well as the "labels". In this case, the labels are the next character after a sequence.

In [7]:
seq_length = 100

dataX = []
dataY = []

for i in range(n_chars - seq_length):
	seq_in = raw_text[i:i + seq_length]
	seq_out = raw_text[i + seq_length]
	dataX.append([char_to_int[char] for char in seq_in])
	dataY.append(char_to_int[seq_out])
 
n_patterns = len(dataX)

In [9]:
print(dataX[0])
print(dataX[1])
print(dataY[0])
print(f"Total pattersn: {n_patterns}")

[17, 29, 12, 25, 22, 16, 25, 30, 31, 16, 20, 25, 0, 0, 26, 29, 1, 31, 19, 16, 1, 24, 26, 15, 16, 29, 25, 1, 27, 29, 26, 24, 16, 31, 19, 16, 32, 30, 0, 0, 13, 36, 1, 24, 12, 29, 36, 1, 34, 26, 23, 23, 30, 31, 26, 25, 16, 14, 29, 12, 17, 31, 1, 18, 26, 15, 34, 20, 25, 1, 30, 19, 16, 23, 23, 16, 36, 0, 0, 0, 1, 14, 26, 25, 31, 16, 25, 31, 30, 0, 0, 1, 23, 16, 31, 31, 16, 29, 1, 3]
[29, 12, 25, 22, 16, 25, 30, 31, 16, 20, 25, 0, 0, 26, 29, 1, 31, 19, 16, 1, 24, 26, 15, 16, 29, 25, 1, 27, 29, 26, 24, 16, 31, 19, 16, 32, 30, 0, 0, 13, 36, 1, 24, 12, 29, 36, 1, 34, 26, 23, 23, 30, 31, 26, 25, 16, 14, 29, 12, 17, 31, 1, 18, 26, 15, 34, 20, 25, 1, 30, 19, 16, 23, 23, 16, 36, 0, 0, 0, 1, 14, 26, 25, 31, 16, 25, 31, 30, 0, 0, 1, 23, 16, 31, 31, 16, 29, 1, 3, 0]
0
Total pattersn: 409637


Once we have the patters, we can transform them to numpy arrays, and perform simple normalization to each sequence.

In [11]:
# reshape X to be [samples, time steps, features]
X = np.reshape(dataX, (n_patterns, seq_length, 1))

# normalize
X = X / float(n_vocab)

# one hot encode the output variable
y = to_categorical(dataY)

In [14]:
print(X.shape, y.shape)
print(y[0])

(409637, 100, 1) (409637, 48)
[1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]


In [17]:
model = models.Sequential()
model.add(layers.LSTM(256, input_shape=(100, 1), return_sequences=True))
model.add(layers.Dropout(0.2))
model.add(layers.LSTM(256))
model.add(layers.Dropout(0.2))
model.add(layers.Dense(48, activation='softmax'))

model.compile(loss='categorical_crossentropy', optimizer=optimizers.RMSprop(learning_rate=0.001), metrics=['accuracy'])

model.summary()

Model: "sequential_2"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 lstm_3 (LSTM)               (None, 100, 256)          264192    
                                                                 
 dropout_3 (Dropout)         (None, 100, 256)          0         
                                                                 
 lstm_4 (LSTM)               (None, 256)               525312    
                                                                 
 dropout_4 (Dropout)         (None, 256)               0         
                                                                 
 dense_2 (Dense)             (None, 48)                12336     
                                                                 
Total params: 801,840
Trainable params: 801,840
Non-trainable params: 0
_________________________________________________________________


In [18]:
checkpoint = ModelCheckpoint('frankenstein_best_weights.h5', monitor='loss', verbose=1, save_best_only=True)
model.fit(X, y, epochs=20, batch_size=256, workers=8, callbacks=[checkpoint])

Epoch 1/20
Epoch 1: loss improved from inf to 2.66376, saving model to frankenstein_best_weights.h5
Epoch 2/20
Epoch 2: loss improved from 2.66376 to 2.32009, saving model to frankenstein_best_weights.h5
Epoch 3/20
Epoch 3: loss improved from 2.32009 to 2.14119, saving model to frankenstein_best_weights.h5
Epoch 4/20
Epoch 4: loss improved from 2.14119 to 2.02629, saving model to frankenstein_best_weights.h5
Epoch 5/20
Epoch 5: loss improved from 2.02629 to 1.94381, saving model to frankenstein_best_weights.h5
Epoch 6/20
Epoch 6: loss improved from 1.94381 to 1.87877, saving model to frankenstein_best_weights.h5
Epoch 7/20
Epoch 7: loss improved from 1.87877 to 1.82687, saving model to frankenstein_best_weights.h5
Epoch 8/20
Epoch 8: loss improved from 1.82687 to 1.78550, saving model to frankenstein_best_weights.h5
Epoch 9/20
Epoch 9: loss improved from 1.78550 to 1.74832, saving model to frankenstein_best_weights.h5
Epoch 10/20
Epoch 10: loss improved from 1.74832 to 1.71743, saving 

<keras.callbacks.History at 0x7f6320270110>

In [20]:
# pick a random seed
start = np.random.randint(0, len(dataX)-1)
pattern = dataX[start]
print("Seed:")
print("\"", ''.join([int_to_char[value] for value in pattern]), "\"")

result_string = ""

# generate characters
for i in range(100):
  x = np.reshape(pattern, (1, len(pattern), 1))
  x = x / float(n_vocab)
  prediction = model.predict(x, verbose=0)
  index = np.argmax(prediction)
  result = int_to_char[index]
  result_string += result
  pattern.append(index)
  pattern = pattern[1:]
print("\nDone.")

print(result_string)

Seed:
" air and revenge withdrew

i left the room and locking the door made a solemn vow in my own
heart nev "

Done.
er were the soutow of the soot which i had been the soutow of the soirits of the soot which i had be
