<a href="https://colab.research.google.com/github/viktoruebelhart/Keras_NPL_News/blob/main/word_completer_keras.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

We built a Deep Learning model to classify articles into different categories. Now, a new need has emerged: our content portal wants editors and journalists to follow a consistent style when writing texts. To do this, we will create a word completer that follows a specific style, similar to the feature found in text editors.

In [1]:
import pandas as pd
import tensorflow as tf

url = 'https://github.com/allanspadini/curso-tensorflow-proxima-palavra/raw/main/dados/train.zip'

df = pd.read_csv(url, header=None, names=['ClassIndex', 'Título', 'Descrição'])

df['Texto'] = df['Título'] + '' + df['Descrição']

In [2]:
df['Texto']

Unnamed: 0,Texto
0,Wall St. Bears Claw Back Into the Black (Reute...
1,Carlyle Looks Toward Commercial Aerospace (Reu...
2,Oil and Economy Cloud Stocks' Outlook (Reuters...
3,Iraq Halts Oil Exports from Main Southern Pipe...
4,"Oil prices soar to all-time record, posing new..."
...,...
119995,Pakistan's Musharraf Says Won't Quit as Army C...
119996,Renteria signing a top-shelf dealRed Sox gener...
119997,Saban not going to Dolphins yetThe Miami Dolph...
119998,Today's NFL gamesPITTSBURGH at NY GIANTS Time:...


In [3]:
import random

random.seed(42)
df_sample = df.sample(n=1000)

We will create a corpus, which is our complete set of texts, by converting the contents of the text column to a list.

In [4]:
corpus = df_sample['Texto'].tolist()

In [5]:
from tensorflow.keras.layers import TextVectorization

Maximum vocabulary size as 20,000 words and maximum sequence size as 50 words.
These variables will be max_vocab_size and max_sequence_len, respectively.

In [6]:
max_vocab_size = 20000
max_sequence_len = 50

In [7]:
vectorizer = TextVectorization(max_tokens=max_vocab_size,
                             output_mode='int',
                             output_sequence_length=max_sequence_len)

In [8]:
vectorizer.adapt(corpus)

In [9]:
token_corpus = vectorizer(corpus)

In [10]:
import pickle

with open('vectorizer.pkl', 'wb') as file:
  pickle.dump(vectorizer, file)

With the vectorizer saved and the text tokenized, we need to organize the data into an ideal format to train the neural network, which will learn to suggest the next words. We will do this by passing different strings of text.

Let's create an empty list called input_sequences to store the text sequences. We will then populate this list based on our tokenized data. We will loop through each list of tokens in the tokenized_corpus (converted to a numeric format) and add these sequences to the input_sequences list.

In [11]:
input_sequences = []

for token_list in token_corpus.numpy():
  for i in range(1, len(token_list)):
    n_gram_sequence = token_list[:i+1]
    input_sequences.append(n_gram_sequence)

In [12]:
input_sequences

[array([7323,  143]),
 array([7323,  143,  543]),
 array([7323,  143,  543,  911]),
 array([7323,  143,  543,  911,  164]),
 array([7323,  143,  543,  911,  164, 6317]),
 array([7323,  143,  543,  911,  164, 6317,   55]),
 array([7323,  143,  543,  911,  164, 6317,   55,  143]),
 array([7323,  143,  543,  911,  164, 6317,   55,  143,   78]),
 array([7323,  143,  543,  911,  164, 6317,   55,  143,   78,   23]),
 array([7323,  143,  543,  911,  164, 6317,   55,  143,   78,   23,  473]),
 array([7323,  143,  543,  911,  164, 6317,   55,  143,   78,   23,  473,
        5493]),
 array([7323,  143,  543,  911,  164, 6317,   55,  143,   78,   23,  473,
        5493,  313]),
 array([7323,  143,  543,  911,  164, 6317,   55,  143,   78,   23,  473,
        5493,  313, 8242]),
 array([7323,  143,  543,  911,  164, 6317,   55,  143,   78,   23,  473,
        5493,  313, 8242,    5]),
 array([7323,  143,  543,  911,  164, 6317,   55,  143,   78,   23,  473,
        5493,  313, 8242,    5, 1173]),


In [13]:
import numpy as np
from tensorflow.keras.preprocessing.sequence import pad_sequences

In [14]:
def prepare_sequences(sequences):
    """
    Prepares the sequences for the model by removing trailing zeros, adding left padding, truncating long sequences, and removing duplicate sequences.

    Args:
        sequences: An array of sequences (lists or NumPy arrays).

    Returns:
        A 2D NumPy array with the prepared sequences.
    """

    # Remove trailing zeros from each sequence
    sequences_without_trailing_zeros = []
    for seq in sequences:
        last_nonzero_index = np.argmax(seq[::-1] != 0)
        if last_nonzero_index == 0 and seq[-1] == 0:
            sequences_without_trailing_zeros.append(np.array([0]))
        else:
            sequences_without_trailing_zeros.append(seq[:-last_nonzero_index or None])

    # Remove duplicate sequences
    unique_sequences = []
    for seq in sequences_without_trailing_zeros:
        if seq.tolist() not in unique_sequences:  # Check if the sequence is already in the list
            unique_sequences.append(seq.tolist())  # Add to the list if it is unique

    # Find the maximum length of the sequences without trailing zeros
    max_sequence_len = max(len(seq) for seq in unique_sequences)

    # Add left padding to ensure the same length
    padded_sequences = pad_sequences(unique_sequences, maxlen=max_sequence_len, padding='pre', truncating='post')

    return padded_sequences


In [17]:
input_sequences_prepared = prepare_sequences(input_sequences)
print(input_sequences)

[[   0    0    0 ...    0 7323  143]
 [   0    0    0 ... 7323  143  543]
 [   0    0    0 ...  143  543  911]
 ...
 [   0    0    0 ... 1480 1215    5]
 [   0    0    0 ... 1215    5   48]
 [   0    0    0 ...    5   48 6814]]


Splitting the Sequence

Let's start by splitting this sequence. The first columns of the matrix will be our features (x), and the last column will be our target (y). This will be a multiclass classification problem, where the number of classes will be equal to the number of words in our vocabulary.

In [18]:
x = input_sequences_prepared[:, :-1]
y = input_sequences_prepared[:, -1]

This separates our input_sequences_prepared array into x and y.

If we view x, we see all columns except the last one. If we view y, we see the last column with numbers representing the words.

In [19]:
x

array([[   0,    0,    0, ...,    0,    0, 7323],
       [   0,    0,    0, ...,    0, 7323,  143],
       [   0,    0,    0, ..., 7323,  143,  543],
       ...,
       [   0,    0,    0, ...,    2, 1480, 1215],
       [   0,    0,    0, ..., 1480, 1215,    5],
       [   0,    0,    0, ..., 1215,    5,   48]], dtype=int32)

In [20]:
y

array([ 143,  543,  911, ...,    5,   48, 6814], dtype=int32)