## Detailed article explaination

The detailed code explanation for this article is available at the following link:

https://www.daniweb.com/programming/computer-science/tutorials/540959/language-modeling-with-lstm-using-wikipedia-text-predicting-next-word

For my other articles for Daniweb.com, please see this link:

https://www.daniweb.com/members/1235222/usmanmalik57

In [44]:
## Import Wikipedia Data

In [23]:
! pip install wikipedia



In [24]:
import wikipedia
pages = wikipedia.search("Artificial Intelligence")
pages

['Artificial intelligence',
 'Generative artificial intelligence',
 'Artificial general intelligence',
 'A.I. Artificial Intelligence',
 'Applications of artificial intelligence',
 'Hallucination (artificial intelligence)',
 'Ethics of artificial intelligence',
 'History of artificial intelligence',
 'Swarm intelligence',
 'Friendly artificial intelligence']

In [25]:
pages = wikipedia.search("Artificial Intelligence", results = 1)
pages

['Artificial intelligence']

In [26]:
ai_page = wikipedia.page(pages[0])
print(ai_page.title)

Artificial intelligence


In [27]:
ai_page.content

'Artificial intelligence (AI) is the intelligence of machines or software, as opposed to the intelligence of humans or animals. It is also the field of study in computer science that develops and studies intelligent machines. "AI" may also refer to the machines themselves.\nAI technology is widely used throughout industry, government and science. Some high-profile applications are: advanced web search engines (e.g., Google Search), recommendation systems (used by YouTube, Amazon, and Netflix), understanding human speech (such as Siri and Alexa), self-driving cars (e.g., Waymo), generative or creative tools (ChatGPT and AI art), and competing at the highest level in strategic games (such as chess and Go).Artificial intelligence was founded as an academic discipline in 1956. The field went through multiple cycles of optimism followed by disappointment and loss of funding, but after 2012, when deep learning surpassed all previous AI techniques, there was a vast increase in funding and int

## Preprocessing Text for LSTM

In [28]:
import re

# Split the text into sentences using '\n' as the separator
sentences = ai_page.content.split('\n')

# Function to remove special characters from a sentence
def remove_special_characters(sentence):
    return re.sub(r'[^\w\s]', '', sentence)

# Filter sentences with at least two words
sentences = [sentence.strip() for sentence in sentences if len(remove_special_characters(sentence).split()) >= 2]

# Printing length of total sentences
print(f"Total number of sentences: {len(sentences)}")

sentences[0]


Total number of sentences: 224


'Artificial intelligence (AI) is the intelligence of machines or software, as opposed to the intelligence of humans or animals. It is also the field of study in computer science that develops and studies intelligent machines. "AI" may also refer to the machines themselves.'

In [29]:
from keras.preprocessing.text import Tokenizer

# Create a tokenizer instance
tokenizer = Tokenizer()

# Fit the tokenizer on the text data
tokenizer.fit_on_texts(sentences)

# Print the word index (word to integer mapping)
tokenizer.word_index

{'the': 1,
 'and': 2,
 'of': 3,
 'to': 4,
 'a': 5,
 'in': 6,
 'that': 7,
 'ai': 8,
 'is': 9,
 'as': 10,
 'for': 11,
 'are': 12,
 'by': 13,
 'or': 14,
 'it': 15,
 'intelligence': 16,
 'be': 17,
 'learning': 18,
 'artificial': 19,
 'can': 20,
 'with': 21,
 'an': 22,
 'machine': 23,
 'on': 24,
 'not': 25,
 'they': 26,
 'such': 27,
 'this': 28,
 'have': 29,
 'used': 30,
 'was': 31,
 'from': 32,
 'human': 33,
 'problems': 34,
 'has': 35,
 'knowledge': 36,
 'these': 37,
 'other': 38,
 'use': 39,
 'most': 40,
 'research': 41,
 'networks': 42,
 'many': 43,
 'may': 44,
 'search': 45,
 'problem': 46,
 'at': 47,
 'neural': 48,
 'been': 49,
 'will': 50,
 'if': 51,
 'were': 52,
 'data': 53,
 'reasoning': 54,
 'about': 55,
 'what': 56,
 'also': 57,
 'deep': 58,
 'researchers': 59,
 'decision': 60,
 'general': 61,
 'including': 62,
 'machines': 63,
 'technology': 64,
 'solve': 65,
 'make': 66,
 'their': 67,
 '–': 68,
 'had': 69,
 'applications': 70,
 'when': 71,
 'there': 72,
 'program': 73,
 'possib

In [30]:
vocab_size = len(tokenizer.word_index)
print(vocab_size)

2472


In [31]:
# Convert text to integers using the tokenizer
int_sequences = tokenizer.texts_to_sequences(sentences)
int_sequences[0]

[19,
 16,
 8,
 9,
 1,
 16,
 3,
 63,
 14,
 175,
 10,
 604,
 4,
 1,
 16,
 3,
 107,
 14,
 990,
 15,
 9,
 57,
 1,
 81,
 3,
 140,
 6,
 82,
 225,
 7,
 605,
 2,
 606,
 98,
 63,
 8,
 44,
 57,
 991,
 4,
 1,
 63,
 992]

In [32]:
processed_sequences = []

for inp_sequence in int_sequences:
  temp_list = inp_sequence[:2]
  processed_sequences.append(temp_list.copy())

  for item in inp_sequence[2:]:
      temp_list.append(item)
      processed_sequences.append(temp_list.copy())


In [33]:
processed_sequences[0], processed_sequences[1], processed_sequences[2]

([19, 16], [19, 16, 8], [19, 16, 8, 9])

In [34]:
# Extract features (X) and labels (Y) using list comprehensions
X = [sequence[:-1] for sequence in processed_sequences]  # Features (excluding the last item in each internal list)
y = [sequence[-1] for sequence in processed_sequences]    # Labels (only the last item in each internal list)


In [35]:
print(f"First 3 sequences: {processed_sequences[0], processed_sequences[1], processed_sequences[2]}")
print(f"Features list:  {X[0], X[1], X[2]}")
print(f"Labels list: {y[0], y[1], y[2]}")

First 3 sequences: ([19, 16], [19, 16, 8], [19, 16, 8, 9])
Features list:  ([19], [19, 16], [19, 16, 8])
Labels list: (16, 8, 9)


In [36]:
# Find the length of the longest sentence. Will use this for padding
max_length = max(len(internal_list) for internal_list in X)
max_length

552

In [37]:
from keras.preprocessing.sequence import pad_sequences

# Apply pre-padding to processed_sequences using pad_sequences function
X = pad_sequences(X, maxlen=max_length, padding='pre')

In [38]:
from keras.utils import to_categorical

y = to_categorical(y, num_classes = vocab_size + 1)

In [39]:
print(f"{X.shape, y.shape}")

((9136, 552), (9136, 2473))


## Training an LSTM Model

In [40]:
from keras.models import Model
from keras.layers import Embedding, LSTM, Dense, Input


def get_model():
  # Define the input shape (sequence length)
  input_shape = (max_length,)

  # Input layer
  input_layer = Input(shape=input_shape)

  # Embedding layer
  embedding_size = 100  # Example embedding size, adjust according to your use case
  embedding_layer = Embedding(input_dim = vocab_size +1,
                              output_dim = 100,
                              input_length =max_length)(input_layer)

  # LSTM layer
  lstm_1 = LSTM(500)(embedding_layer)

  # Output layer with softmax activation
  output_layer = Dense(vocab_size + 1,
                      activation='softmax',
                      name='output_layer')(lstm_1)

  # Create the model
  model = Model(inputs=input_layer, outputs=output_layer)

  # Compile the model with Adam optimizer, categorical cross-entropy loss, and accuracy metric
  model.compile(optimizer='adam',
                loss='categorical_crossentropy',
                metrics=['accuracy'])

  return model

model = get_model()
# Print the model summary
model.summary()


Model: "model_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_2 (InputLayer)        [(None, 552)]             0         
                                                                 
 embedding_1 (Embedding)     (None, 552, 100)          247300    
                                                                 
 lstm_1 (LSTM)               (None, 500)               1202000   
                                                                 
 output_layer (Dense)        (None, 2473)              1238973   
                                                                 
Total params: 2688273 (10.25 MB)
Trainable params: 2688273 (10.25 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


In [41]:
# Fit the model using X_train and y_train
model.fit(X, y,
          epochs = 30
          )


Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30
Epoch 30/30


<keras.src.callbacks.History at 0x78e73affe470>

In [42]:
# Import the necessary libraries
import numpy as np

# Define a function 'generate_text' that takes input text and the number of words to generate as parameters
def generate_text(input_text, num_gen):

  # Initialize the current input with the given input text
  current_input = input_text

  # Iterate 'num_gen' times to generate the specified number of words
  for i in range(num_gen):
    # Convert the current input text into tokenized form using the tokenizer
    tokenized_text = tokenizer.texts_to_sequences([current_input])[0]

    # Pad the tokenized text to match the required input length for the model
    padded_text = pad_sequences([tokenized_text],
                                maxlen=max_length,
                                padding='pre')

    # Use the model to predict the next word in the sequence
    prediction = model.predict(padded_text, verbose=0)

    # Get the index of the predicted word with the highest probability
    predicted_index = np.argmax(prediction)

    # Find the corresponding word for the predicted index using the tokenizer's word index
    predicted_word = []
    for word, index in tokenizer.word_index.items():
      if index == predicted_index:
        predicted_word = word
        break;

    # Add the predicted word to the current input for the next iteration
    current_input = current_input + " " + predicted_word

  # Return the generated text
  return current_input


In [43]:
input = "natural"
words_to_generate = 20
output = generate_text(input, words_to_generate)
output

'natural language processing nlp allows programs to read write and communicate in human languages such as english and go had simulate'