# Neural Language Model - Basic (Word Prediction Example)

In this example, I'll show an example of simple language model.<br>
In general, the language model is used for a variety of NLP tasks, such as, translation, transcription, summarization, etc. In this example, however, we just train and use the model for text generation (next word prediction).<br>
Unlike [previous example](./02_custom_embedding.ipynb), these models will recognize the order of words in the sequence.

RNN-based specialized architecture (such as, LSTM or Transformer) is widely used to train language model in today's algrithms. But, in this example, I'll briefly apply only DenseNet (feed-forward network) for the purpose of your beginning. (In the later examples, we'll discuss more practical models with RNN.)<br>
See the following diagram for entire network.

![Model in this exercise](images/language_model_beginning.png?raw=true)

Thereby, I note that this model won't care the past context.<br>
For example, even when the following sentence is given, 

"In the United States, the president has now been"

it won't care the context "In the United States" when it refers the last 5 words in the network. (It might then predict the incorrect word in this context and the accuracy won't also be so high in this example. See [later example](./06_language_model_rnn.ipynb) for RNN-based architecture, which will address this problem.)

Nevertheless, the neural language models will be well-generalized more than traditional statistical models for unseen data. For instance, if "red shirt" and "blud shirt" occurs in training set, "green shirt" (which is not seen in training set) will also be predicted by the trained neural model, because the model knows that "red", "blue", and "green" occur in the same context.

The language model in this example can be treated as unsupervised approach.<br>
As you saw in [previous example](./02_custom_embedding.ipynb), the word embedding is also a byproduct in generated language model.

*back to [index](https://github.com/tsmatz/nlp-tutorials/)*

## Install required packages

In [None]:
!pip install tensorflow==2.6.2 pandas numpy

## Prepare data

In this example, I have used short description text in news papers, since it's formal-styled concise sentence. (It's today's modern English, not including slangs.)<br>
Before starting, please download [News_Category_Dataset_v2.json](https://www.kaggle.com/datasets/rmisra/news-category-dataset) (collected by HuffPost) in Kaggle.

In [1]:
import pandas as pd

df = pd.read_json("News_Category_Dataset_v2.json",lines=True)
train_data = df["short_description"]
train_data

0         She left her husband. He killed their children...
1                                  Of course it has a song.
2         The actor and his longtime girlfriend Anna Ebe...
3         The actor gives Dems an ass-kicking for not fi...
4         The "Dietland" actress said using the bags is ...
                                ...                        
200848    Verizon Wireless and AT&T are already promotin...
200849    Afterward, Azarenka, more effusive with the pr...
200850    Leading up to Super Bowl XLVI, the most talked...
200851    CORRECTION: An earlier version of this story i...
200852    The five-time all-star center tore into his te...
Name: short_description, Length: 200853, dtype: object

To get the better performance (accuracy), we standarize the input text as follows.
- Make all words to lowercase in order to reduce words
- Make "-" (hyphen) to space
- Remove all punctuation except "'" (e.g, Ken's bag) and "&" (e.g, AT&T)

> Note : N-gram words (such as, "New York", "ice cream") should also be dealed with, but here I have skipped these pre-processing to make simplify.<br>
> In the strict pre-processing, we should also care about the polysemy. (The different meanings in the same word should have different tokens.)

In [2]:
train_data = train_data.str.lower()
train_data = train_data.str.replace("-"," ")
train_data = train_data.str.replace("[^'\&\w\s]","")
train_data = train_data.str.strip()
train_data

0         she left her husband he killed their children ...
1                                   of course it has a song
2         the actor and his longtime girlfriend anna ebe...
3         the actor gives dems an ass kicking for not fi...
4         the dietland actress said using the bags is a ...
                                ...                        
200848    verizon wireless and at&t are already promotin...
200849    afterward azarenka more effusive with the pres...
200850    leading up to super bowl xlvi the most talked ...
200851    correction an earlier version of this story in...
200852    the five time all star center tore into his te...
Name: short_description, Length: 200853, dtype: object

## Generate sequence inputs

Same as in [previous example](02_custom_embedding.ipynb), we will generate the sequence of word's indices (i.e, tokenize) from text.

![Index vectorize](images/index_vectorize.png?raw=true)

In [3]:
import tensorflow as tf

max_word = 70000

corpus = " ".join(train_data)
new_tokens = [w for w in corpus.split() if w.isalpha()]
new_corpus = " ".join(new_tokens)
tokenizer = tf.keras.preprocessing.text.Tokenizer(
    num_words=max_word,
    oov_token="[UNK]"
)
tokenizer.fit_on_texts([new_corpus])
#vocab_size = len(tokenizer.word_index)

In our example, each sentence is separated into 5 preceding word's sequence and word label (total 6 words in each sequence) as follows.

![Separate words](images/separate_sequence_for_next_words.png?raw=true)

In [4]:
import numpy as np

seq_len = 5 + 1
input_seq = []
for s in train_data:
    token_list = tokenizer.texts_to_sequences([s])[0]
    # add termination index 0
    token_list.append(0)
    for i in range(seq_len, len(token_list) + 1):
        seq_list = token_list[i-seq_len:i]
        input_seq.append(seq_list)
print("The number of training input sequence :{}".format(len(input_seq)))
input_seq = np.array(input_seq)

The number of training input sequence :3266478


In [5]:
X, y = input_seq[:,:-1], input_seq[:,-1]
train_tf_data = tf.data.Dataset.from_tensor_slices((X, y))
def to_one_hot(x, y):
   return x, tf.one_hot(y, depth=max_word)
train_tf_data = train_tf_data.map(lambda x, y: to_one_hot(x, y))
#tf.data.experimental.save(train_tf_data, "saved_data")

## Build network and Train

In [None]:
#train_tf_data = tf.data.experimental.load("saved_data")

In [6]:
embedding_dim = 64

model = tf.keras.models.Sequential()
model.add(tf.keras.layers.Embedding(
    max_word,
    embedding_dim,
    input_length=seq_len - 1,
    trainable=True,
    name="embedding"))
model.add(tf.keras.layers.Flatten())
model.add(tf.keras.layers.Dense(
    256,
    activation="relu",
    trainable=True))
model.add(tf.keras.layers.Dense(
    78,
    activation="relu",
    trainable=True))
model.add(tf.keras.layers.Dense(
    max_word,
    activation=None,
    trainable=True))

model.compile(
    optimizer=tf.keras.optimizers.Adam(0.001),
    loss=tf.keras.losses.CategoricalCrossentropy(from_logits=True),
    metrics=["accuracy"])

In [7]:
model.fit(
    train_tf_data.shuffle(10000).batch(512),
    epochs=60)

Epoch 1/60
Epoch 2/60
Epoch 3/60
Epoch 4/60
Epoch 5/60
Epoch 6/60
Epoch 7/60
Epoch 8/60
Epoch 9/60
Epoch 10/60
Epoch 11/60
Epoch 12/60
Epoch 13/60
Epoch 14/60
Epoch 15/60
Epoch 16/60
Epoch 17/60
Epoch 18/60
Epoch 19/60
Epoch 20/60
Epoch 21/60
Epoch 22/60
Epoch 23/60
Epoch 24/60
Epoch 25/60
Epoch 26/60
Epoch 27/60
Epoch 28/60
Epoch 29/60
Epoch 30/60
Epoch 31/60
Epoch 32/60
Epoch 33/60
Epoch 34/60
Epoch 35/60
Epoch 36/60
Epoch 37/60
Epoch 38/60
Epoch 39/60
Epoch 40/60
Epoch 41/60
Epoch 42/60
Epoch 43/60
Epoch 44/60
Epoch 45/60
Epoch 46/60
Epoch 47/60
Epoch 48/60
Epoch 49/60
Epoch 50/60
Epoch 51/60
Epoch 52/60
Epoch 53/60
Epoch 54/60
Epoch 55/60
Epoch 56/60
Epoch 57/60
Epoch 58/60
Epoch 59/60
Epoch 60/60


<keras.callbacks.History at 0x7f17050820f0>

In [None]:
# model.save("trained_model/exercise04")

# Generate text

In this example, I'll just show you how it generates a sentence by predicting the possibility of vocabularies over the given recent 5 words, until predicting the end-of-sequence.<br>
As I have mentioned above, I note that this model doesn't recognize the past context, because this model refers only last 5 words.

> Note : This approach (which repeatedly predicts the next word in each step and generates a consequent sentence) may sometimes lead you to sub-optimal solutions (i.e, label-bias problem). Here I don't go so far, but you can take other approaches when you need globally high probability sentence in practice.

In [None]:
# model = tf.keras.models.load_model("trained_model/exercise04")

In [23]:
def pred_output(sentence):
    test_seq = tokenizer.texts_to_sequences([sentence])[0]
    while True:
        pred_val = model.predict([test_seq[-5:]])
        pred_class = np.asscalar(np.argmax(pred_val, axis=1))
        if pred_class == 0:
            break
        test_seq.append(pred_class)
        for i in test_seq:
            list_index = list(tokenizer.word_index.values()).index(i)
            print(list(tokenizer.word_index.keys())[list_index], end=" ")
        print("\n")

pred_output("In the United States president")
pred_output("The president has accused by")
pred_output("Now he was expected to")

  """


in the united states president barack 

in the united states president barack obama 

the president has accused by a 

the president has accused by a provocation 

the president has accused by a provocation by 

the president has accused by a provocation by sexual 

the president has accused by a provocation by sexual misconduct 

now he was expected to be 

now he was expected to be a 

now he was expected to be a [UNK] 

now he was expected to be a [UNK] year 

now he was expected to be a [UNK] year old 



## To the next exercise

As you saw above, the model in this example is computationally expensive, because it needs the probability over all target words and will then consume a lot of computing resources (memory and disk space) depending on output's vocabulary size. (When it has 70,000 words and 3,000,000 records in training set, it will need 70,000 * 3,000,000 float values.)<br>
In order for making it scalable to unlimited vocabularies, the algorithm can be modified, so called Negative Sampling (NS) method. In the next exercise, I'll show you Negative Sampling (NS) in well-known Word2Vec algorithm.