In [1]:
# Requirements: scikit-learn, nltk, gensim, tensorflow
# ! pip install scikit-learn
# ! pip install nltk
# ! pip install gensim
# ! pip install tensorflow

# A Super High-Level Tutorial on Some Basics of Natural Language Processing

No, despite this NLP tutorial involving neural networks, language, and teaching you programming, NLP does not stand for neuro-linguistic programming here. (However, it could stand for that if you're willing to step into such a future...)

#### What is meant by "natural language"?
Well, it's something with a pretty good Wikipedia article: https://en.wikipedia.org/wiki/Natural_language
Importantly, it's a language that you can understand "without conscious planning or premeditation". This is in contrast to a formal language, which is the kind of language you'd expect a computer to be able to understand.

NLP is all about using formal languages to create representations of natural languages.

#### NLP Jargon Dictionary

This is definitely not comprehensive, but it covers all of the jargon in this notebook.

- **Token** >> The smallest unit of natural-language information available to a machine. For example, the strings " wildcat " or " 's ". You might have "I like wildcats" as a token, but you should probably split it up into three tokens.
- **Document.** >> A single collection of tokens that go together. A sentence might be a document, or if you're classifying poems by author, the entirety of T. S. Eliot's *The Waste Land* might be a document in your training set.
- **Corpus.** >> Your entire collection of tokens and/or documents. In the previous example, all your poems would constitute your corpus, before you do a train/validation/test split.
- **Embedding** >> The representation of natural-language information in vector space.

#### How the heck do you even quantify language?

A naive -- but essential -- way to turn linguistic information into mathematical objects is the "bag of words" approach, which assumes that your tokens are fundamental units of meaning:

In [2]:
# Start with some text that is super easy for humans to parse...but maybe not to understand ㋡
koan = "A monk said to Seigen, “What is the essence of Buddhism?” Seigen said, “What is the price of rice in Roryo?”"

# Remove punctuation. This does throw away some linguistic information, but it's easier to not have to deal with it now.
punctuation_dict = [',', '“', '?', '”']
for char in punctuation_dict:
    koan = koan.replace(char, '')
print(koan)

A monk said to Seigen What is the essence of Buddhism Seigen said What is the price of rice in Roryo


In [3]:
# Split up the koan so the computer can recognize the words as distinct objects
split_koan = koan.split(' ')
print(split_koan)
print(f"Number of words: {len(split_koan)}")

['A', 'monk', 'said', 'to', 'Seigen', 'What', 'is', 'the', 'essence', 'of', 'Buddhism', 'Seigen', 'said', 'What', 'is', 'the', 'price', 'of', 'rice', 'in', 'Roryo']
Number of words: 21


In [4]:
from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()
X = count_vect.fit_transform(split_koan)
print("Vocabulary: ", count_vect.get_feature_names_out(),"\n")
print("Vocab size: ", count_vect.get_feature_names_out().shape, "\n")
print(X.toarray())
X.shape

Vocabulary:  ['buddhism' 'essence' 'in' 'is' 'monk' 'of' 'price' 'rice' 'roryo' 'said'
 'seigen' 'the' 'to' 'what'] 

Vocab size:  (14,) 

[[0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 1 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 1 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 1 0]
 [0 0 0 0 0 0 0 0 0 0 1 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 1]
 [0 0 0 1 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 1 0 0]
 [0 1 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 1 0 0 0 0 0 0 0 0]
 [1 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 1 0 0 0]
 [0 0 0 0 0 0 0 0 0 1 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 1]
 [0 0 0 1 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 1 0 0]
 [0 0 0 0 0 0 1 0 0 0 0 0 0 0]
 [0 0 0 0 0 1 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 1 0 0 0 0 0 0]
 [0 0 1 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 1 0 0 0 0 0]]


(21, 14)

This is now a (very sparse) list of features you can train a model on, for example, to predict what word comes next! But we'll get to that later.

In [5]:
word_of_interest = "Seigen"
where_does_it_say = [idx for idx in range(len(split_koan)) if split_koan[idx] == word_of_interest]
print(f"It says '{word_of_interest}' at position(s): {where_does_it_say}")
print("Here are the corresponding rows in the token matrix:")
for instance in where_does_it_say: # Haha, "for instance"
    print(X.toarray()[instance])

It says 'Seigen' at position(s): [4, 11]
Here are the corresponding rows in the token matrix:
[0 0 0 0 0 0 0 0 0 0 1 0 0 0]
[0 0 0 0 0 0 0 0 0 0 1 0 0 0]


In [6]:
# You can get the frequency of each token by taking the sum over the 0 axis
print(count_vect.get_feature_names_out())
print(X.toarray().sum(axis=0))

['buddhism' 'essence' 'in' 'is' 'monk' 'of' 'price' 'rice' 'roryo' 'said'
 'seigen' 'the' 'to' 'what']
[1 1 1 2 1 2 1 1 1 2 2 2 1 2]


## Let's complicate things a bit.

Bag of words is super easy to do, but the representation it builds of a document is very high-dimensional and very sparse. This isn't good for memory or computation time, so we'd have to get creative. Good thing some people are paid to be creative so we don't have to be.

#### Using Word2Vec
Word2Vec is an algorithm that 1) creates vector representations of tokens as we did above (that is, embeddings) and 2) uses a neural network to either predict a word given its context or, given a word, to predict its context. I won't go into more detail than that, but the original paper is here, and the whole thing is very searchable since it was so widely used: https://arxiv.org/pdf/1301.3781.pdf

(Side note: One of the corresponding authors' emails is "jeff@google.com", an account name which I covet.)

In [7]:
import nltk
nltk.download("state_union")

[nltk_data] Downloading package state_union to
[nltk_data]     C:\Users\Raman\AppData\Roaming\nltk_data...
[nltk_data]   Package state_union is already up-to-date!


True

In [8]:
from nltk.corpus import state_union
sotu = list(state_union.sents())
sotu[:2]

[['PRESIDENT',
  'HARRY',
  'S',
  '.',
  'TRUMAN',
  "'",
  'S',
  'ADDRESS',
  'BEFORE',
  'A',
  'JOINT',
  'SESSION',
  'OF',
  'THE',
  'CONGRESS'],
 ['April', '16', ',', '1945']]

Already we can see an approaching problem. The dataset has already been split up by word, but we still have to remove a ton of punctuation, and near-useless tokens like the possessive "S". Also, words like "A" don't contribute much to meaning, but they do take up space in our high-dimensional embeddings, when that space could be used to better represent more meaning from other tokens. So, we also have to remove "A" and other similar "stopwords".

In [9]:
nltk.download('stopwords')
from nltk.corpus import stopwords
stop_words = set(stopwords.words())

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Raman\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [10]:
import tqdm
punctuation = {',', '.', "'", '"', ':', ';', '!', '?', '-', '--'}

cleaned_corpus = []
for sentence in tqdm.tqdm(sotu, desc="Cleaning token list"):
    cleaned_sentence = []
    for token in sentence:
        token = token.lower() # Because Harry Truman didn't yell much
        if (token not in stop_words) and (token not in punctuation):
            cleaned_sentence.append(token)
    cleaned_corpus.append(cleaned_sentence)
print(cleaned_corpus[:2])

Cleaning token list: 100%|██████████| 17930/17930 [00:00<00:00, 125363.19it/s]

[['president', 'harry', 'truman', 'address', 'joint', 'session', 'congress'], ['april', '16', '1945']]





Let's take a look at the most frequent tokens, now that the stopwords have been removed.

In [11]:
cleaned_corpus_flat = [token for sentence in cleaned_corpus for token in sentence]

freq_dist = nltk.FreqDist(cleaned_corpus_flat)
print("Most frequent tokens:")
print(sorted(freq_dist, key=freq_dist.__getitem__, reverse=True)[:10])

Most frequent tokens:
['world', 'year', 'america', 'congress', 'government', 'years', 'american', 'nation', 'time', 'peace']


Looks good! But there's still a potential problem. We removed stopwords because their embeddings will be relatively meaningless. "America", on the other hand, is not meaningless (insert patriotic eagle call here). But "American" is so semantically similar to "America" that it might not be useful to include it as well. We can replace the words in our corpus with their more fundamental morphemes in an attempt to preserve the quality of our embeddings:

e.g. American → America, government → govern, etc.

There is no good way to do this, as far as I know. Many methods will do things like "nation → n" because "ation" is a common suffix. We won't do this here because the stemming involves a lot of lookups and would take forever on our dataset, but here's a sample below.

In [12]:
from nltk.stem.snowball import EnglishStemmer
stemmer = EnglishStemmer()

stemmed_tokens = ['']*20
for idx in tqdm.trange(len(stemmed_tokens), desc="Stemming words"):
    stemmed_tokens[idx] = stemmer.stem(cleaned_corpus_flat[idx])
print(stemmed_tokens)

Stemming words: 100%|██████████| 20/20 [00:00<00:00, 20001.45it/s]

['presid', 'harri', 'truman', 'address', 'joint', 'session', 'congress', 'april', '16', '1945', 'mr', 'speaker', 'mr', 'presid', 'member', 'congress', 'heavi', 'heart', 'stand', 'friend']





In [13]:
# Now we train the Word2Vec model!
from gensim.models import Word2Vec
print("Creating embeddings")
model = Word2Vec(sentences=cleaned_corpus, vector_size=100, min_count=1, seed=0, workers=1)  # Set workers=1 to keep training deterministic
print(f"Words in the vocabulary: {len(model.wv)}")

Creating embeddings
Words in the vocabulary: 12311


Now for the obligatory NLP quote:
## "You shall know a word by the company it keeps."
#### --John Rupert Firth

As mentioned before, Word2Vec is great at generating contextual associations between words. It accomplishes this by passing a sliding window over its training samples to see which words frequently occur near other words. This forms a pretty darn good approximation of semantic similarity.

In [14]:
# We can see which words are associated with a word in the model's vocabulary
print(model.wv.most_similar('mr', topn=4))

[('speaker', 0.9835394620895386), ('president', 0.9781222939491272), ('members', 0.9631961584091187), ('vice', 0.9588627219200134)]


We can also retrain the model, using a "skip-gram" tokenization scheme instead of bag of words, and change the size of the window. This will allow us to better predict the next word in a string of words.

In [15]:
print("Creating embeddings")
sgmodel = Word2Vec(sentences=cleaned_corpus, vector_size=100, min_count=1, seed=0, workers=1, sg=1, window=2)
print(f"Words in the vocabulary: {len(sgmodel.wv)}")

Creating embeddings
Words in the vocabulary: 12311


In [16]:
print(sgmodel.wv.most_similar('mr', topn=4))

[('vice', 0.9910067319869995), ('speaker', 0.9890087842941284), ('president', 0.9488757848739624), ('distinguished', 0.9383575916290283)]


## Autobots, Roll Out!

Using neural networks brings us to the state of the art: transformer models. (For now, anyway. But if not: Hi, people of the future! I hope the GPT-23 is treating you well.) Transformers are great! They're inherently distributable, which helps with training on a multi-core machine, and rather than "forgetting" useless information like recurrent LSTM models do, they pay "attention" to what information is associated with other information.

![attention](attention.png)

This type of attention, where tokens in a sequence are associated with other tokens in the same sequence, is called self-attention. Also computed by a transformer's attention module is "cross-attention", which -- guess what -- associates tokens between two different sequences. 

![cross](cross.png)

(Image from www.tensorflow.org/text/tutorials/transformer)

Following is some code adapted from https://keras.io/examples/nlp/neural_machine_translation_with_transformer/ .

It's not commented, because there's a lot of detail I'd have to go into in order to explain it, and I'm a busy man. Sorry.

Or should I say...mi diaspiace.

In [17]:
import pathlib
import random
import string
import re
import numpy as np
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras.layers import TextVectorization
import matplotlib.pyplot as plt
%matplotlib inline

In [18]:
with open("ita-eng/ita.txt", encoding="utf-8") as f:
    lines = f.read().split("\n")[:-1]
text_pairs = []
for line in lines:
    eng, ita, junk = line.split("\t")
    ita = "[start] " + ita + " [end]"
    text_pairs.append((eng, ita))
    
for _ in range(5):
    print(random.choice(text_pairs))

('Tom tossed Mary the ball.', '[start] Tom lanciò la palla a Mary. [end]')
("I didn't have time to watch TV yesterday.", '[start] Non avevo il tempo di guardare la TV ieri. [end]')
('I went there recently.', '[start] Io sono andata lì recentemente. [end]')
('What do you think of the new teacher?', '[start] Che ne pensate del nuovo professore? [end]')
('I tried it.', "[start] Io l'ho provato. [end]")


In [19]:
random.shuffle(text_pairs)
num_valid_samples = int(0.15 * len(text_pairs))
num_train_samples = len(text_pairs) - 2 * num_valid_samples
train_pairs = text_pairs[:num_train_samples]
valid_pairs = text_pairs[num_train_samples : num_train_samples + num_valid_samples]
test_pairs = text_pairs[num_train_samples + num_valid_samples :]

print(f"{len(text_pairs)} total pairs")
print(f"{len(train_pairs)} training pairs")
print(f"{len(valid_pairs)} validation pairs")
print(f"{len(test_pairs)} test pairs")

364200 total pairs
254940 training pairs
54630 validation pairs
54630 test pairs


In [20]:
# Reduce dataset size, or else this will take forever to train
train_pairs = text_pairs[:1000]
valid_pairs = text_pairs[:150]
test_pairs = text_pairs[:150]

In [21]:
strip_chars = string.punctuation + "¿"
strip_chars = strip_chars.replace("[", "")
strip_chars = strip_chars.replace("]", "")

vocab_size = 15000
sequence_length = 20
batch_size = 64


def custom_standardization(input_string):
    lowercase = tf.strings.lower(input_string)
    return tf.strings.regex_replace(lowercase, "[%s]" % re.escape(strip_chars), "")


eng_vectorization = TextVectorization(
    max_tokens=vocab_size, output_mode="int", output_sequence_length=sequence_length)
ita_vectorization = TextVectorization(
    max_tokens=vocab_size,
    output_mode="int",
    output_sequence_length=sequence_length + 1,
    standardize=custom_standardization)
train_eng_texts = [pair[0] for pair in train_pairs]
train_ita_texts = [pair[1] for pair in train_pairs]
eng_vectorization.adapt(train_eng_texts)
ita_vectorization.adapt(train_ita_texts)

In [22]:
def format_dataset(eng, ita):
    eng = eng_vectorization(eng)
    ita = ita_vectorization(ita)
    return ({"encoder_inputs": eng, "decoder_inputs": ita[:, :-1],}, ita[:, 1:])


def make_dataset(pairs):
    eng_texts, ita_texts = zip(*pairs)
    eng_texts = list(eng_texts)
    ita_texts = list(ita_texts)
    dataset = tf.data.Dataset.from_tensor_slices((eng_texts, ita_texts))
    dataset = dataset.batch(batch_size)
    dataset = dataset.map(format_dataset)
    return dataset.shuffle(2048).prefetch(16).cache()


train_ds = make_dataset(train_pairs)
valid_ds = make_dataset(valid_pairs)

In [23]:
class TransformerEncoder(layers.Layer):
    def __init__(self, embed_dim, dense_dim, num_heads, **kwargs):
        super().__init__(**kwargs)
        self.embed_dim = embed_dim
        self.dense_dim = dense_dim
        self.num_heads = num_heads
        self.attention = layers.MultiHeadAttention(
            num_heads=num_heads, key_dim=embed_dim
        )
        self.dense_proj = keras.Sequential(
            [layers.Dense(dense_dim, activation="relu"), layers.Dense(embed_dim),]
        )
        self.layernorm_1 = layers.LayerNormalization()
        self.layernorm_2 = layers.LayerNormalization()
        self.supports_masking = True

    def call(self, inputs, mask=None):
        if mask is not None:
            padding_mask = tf.cast(mask[:, tf.newaxis, :], dtype="int32")
        attention_output = self.attention(
            query=inputs, value=inputs, key=inputs, attention_mask=padding_mask)
        proj_input = self.layernorm_1(inputs + attention_output)
        proj_output = self.dense_proj(proj_input)
        return self.layernorm_2(proj_input + proj_output)


class PositionalEmbedding(layers.Layer):
    def __init__(self, sequence_length, vocab_size, embed_dim, **kwargs):
        super().__init__(**kwargs)
        self.token_embeddings = layers.Embedding(
            input_dim=vocab_size, output_dim=embed_dim
        )
        self.position_embeddings = layers.Embedding(
            input_dim=sequence_length, output_dim=embed_dim
        )
        self.sequence_length = sequence_length
        self.vocab_size = vocab_size
        self.embed_dim = embed_dim

    def call(self, inputs):
        length = tf.shape(inputs)[-1]
        positions = tf.range(start=0, limit=length, delta=1)
        embedded_tokens = self.token_embeddings(inputs)
        embedded_positions = self.position_embeddings(positions)
        return embedded_tokens + embedded_positions

    def compute_mask(self, inputs, mask=None):
        return tf.math.not_equal(inputs, 0)


class TransformerDecoder(layers.Layer):
    def __init__(self, embed_dim, latent_dim, num_heads, **kwargs):
        super().__init__(**kwargs)
        self.embed_dim = embed_dim
        self.latent_dim = latent_dim
        self.num_heads = num_heads
        self.attention_1 = layers.MultiHeadAttention(
            num_heads=num_heads, key_dim=embed_dim
        )
        self.attention_2 = layers.MultiHeadAttention(
            num_heads=num_heads, key_dim=embed_dim
        )
        self.dense_proj = keras.Sequential(
            [layers.Dense(latent_dim, activation="relu"), layers.Dense(embed_dim),]
        )
        self.layernorm_1 = layers.LayerNormalization()
        self.layernorm_2 = layers.LayerNormalization()
        self.layernorm_3 = layers.LayerNormalization()
        self.supports_masking = True

    def call(self, inputs, encoder_outputs, mask=None):
        causal_mask = self.get_causal_attention_mask(inputs)
        if mask is not None:
            padding_mask = tf.cast(mask[:, tf.newaxis, :], dtype="int32")
            padding_mask = tf.minimum(padding_mask, causal_mask)

        attention_output_1 = self.attention_1(
            query=inputs, value=inputs, key=inputs, attention_mask=causal_mask
        )
        out_1 = self.layernorm_1(inputs + attention_output_1)

        attention_output_2 = self.attention_2(
            query=out_1,
            value=encoder_outputs,
            key=encoder_outputs,
            attention_mask=padding_mask,
        )
        out_2 = self.layernorm_2(out_1 + attention_output_2)

        proj_output = self.dense_proj(out_2)
        return self.layernorm_3(out_2 + proj_output)

    def get_causal_attention_mask(self, inputs):
        input_shape = tf.shape(inputs)
        batch_size, sequence_length = input_shape[0], input_shape[1]
        i = tf.range(sequence_length)[:, tf.newaxis]
        j = tf.range(sequence_length)
        mask = tf.cast(i >= j, dtype="int32")
        mask = tf.reshape(mask, (1, input_shape[1], input_shape[1]))
        mult = tf.concat(
            [tf.expand_dims(batch_size, -1), tf.constant([1, 1], dtype=tf.int32)],
            axis=0,
        )
        return tf.tile(mask, mult)

In [24]:
embed_dim = 256
latent_dim = 2048
num_heads = 8

encoder_inputs = keras.Input(shape=(None,), dtype="int64", name="encoder_inputs")
x = PositionalEmbedding(sequence_length, vocab_size, embed_dim)(encoder_inputs)
encoder_outputs = TransformerEncoder(embed_dim, latent_dim, num_heads)(x)
encoder = keras.Model(encoder_inputs, encoder_outputs)

decoder_inputs = keras.Input(shape=(None,), dtype="int64", name="decoder_inputs")
encoded_seq_inputs = keras.Input(shape=(None, embed_dim), name="decoder_state_inputs")
x = PositionalEmbedding(sequence_length, vocab_size, embed_dim)(decoder_inputs)
x = TransformerDecoder(embed_dim, latent_dim, num_heads)(x, encoded_seq_inputs)
x = layers.Dropout(0.5)(x)
decoder_outputs = layers.Dense(vocab_size, activation="softmax")(x)
decoder = keras.Model([decoder_inputs, encoded_seq_inputs], decoder_outputs)

decoder_outputs = decoder([decoder_inputs, encoder_outputs])
transformer = keras.Model(
    [encoder_inputs, decoder_inputs], decoder_outputs, name="transformer"
)

In [25]:
epochs = 20

transformer.summary()
transformer.compile(
    "rmsprop", loss="sparse_categorical_crossentropy", metrics=["accuracy"]
)
transformer.fit(train_ds, epochs=epochs, validation_data=valid_ds, verbose=True)

Model: "transformer"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
 encoder_inputs (InputLayer)    [(None, None)]       0           []                               
                                                                                                  
 positional_embedding (Position  (None, None, 256)   3845120     ['encoder_inputs[0][0]']         
 alEmbedding)                                                                                     
                                                                                                  
 decoder_inputs (InputLayer)    [(None, None)]       0           []                               
                                                                                                  
 transformer_encoder (Transform  (None, None, 256)   3155456     ['positional_embedding[

<keras.callbacks.History at 0x155ffab4a00>

In [26]:
ita_vocab = ita_vectorization.get_vocabulary()
ita_index_lookup = dict(zip(range(len(ita_vocab)), ita_vocab))
max_decoded_sentence_length = 20


def decode_sequence(input_sentence):
    tokenized_input_sentence = eng_vectorization([input_sentence])
    decoded_sentence = "[start]"
    for i in range(max_decoded_sentence_length):
        tokenized_target_sentence = ita_vectorization([decoded_sentence])[:, :-1]
        predictions = transformer([tokenized_input_sentence, tokenized_target_sentence])

        sampled_token_index = np.argmax(predictions[0, i, :])
        sampled_token = ita_index_lookup[sampled_token_index]
        decoded_sentence += " " + sampled_token

        if sampled_token == "[end]":
            break
    return decoded_sentence


test_eng_texts = [pair[0] for pair in test_pairs]
translated_texts = []
for _ in range(20):
    input_sentence = random.choice(test_eng_texts)
    translated = decode_sequence(input_sentence)
    translated_texts.append(translated)
    print(input_sentence, translated, "\n")

Did you know that Tom was using cocaine? [start] ha detto che era troppo [end] 

This horse hasn't been ridden in weeks. [start] questo libro non è stato in biblioteca [end] 

Did you also invite your friends? [start] ha detto di non la sua sorella [end] 

Tom made me something to eat. [start] tom mi ha intenzione di non riesco a parlare [end] 

Your contribution to the school is tax-deductible. [start] la vostra di più duramente [end] 

This horse hasn't been ridden in weeks. [start] questo libro non è stato in biblioteca [end] 

Why don't you have any children yet? [start] perché non lo vuoi un errore [end] 

What does Tom have? [start] che i miei soldi [end] 

See if the gas is turned off. [start] era un dottore vero [end] 

Tom's parents were teachers. [start] i genitori di noi erano bagnati di noi [end] 

I believe there is a problem. [start] io mi sento un dottore vero [end] 

Sydney has a beautiful natural harbor. [start] ha un dottore vero [end] 

I slept soundly. [start] io so

In [28]:
your_sentence = "Am I a better translator than Google?" # Probably not.
tua_frase = decode_sequence(your_sentence)
print(tua_frase)

[start] le sono un po di sua di cibo [end]
