# Text Generation

## Overview

Text generation is a subfield of natural language processing (NLP). It leverages knowledge in computational linguistics and artificial intelligence to automatically generate natural language texts, which can satisfy certain communicative requirements.
Deep Learning (DL) models are trained to generate random but hopefully meaningful text in the simplest form.

This script focuses on generating text using 2 approaches and models:
1. Transformers - Feeding seed text to a transformer model for text generation (transfer learning).
2. Long short-term memory (LSTM) - Training LSTM model on a custom dataset and generating text.

This notebook will include the following sections:
- Libraries and data importation.
- Data cleaning.
- Transformers text generation.
- LSTM text generation.


##1. Import Libraries


Hugging Face Transformers functions provide a pool of pre-trained models to perform various tasks such as vision, text, and audio. Transformers provides APIs to download and experiment with the pre-trained models, and we can even fine-tune them on our datasets.

In [1]:
## downloa transformers package from hugging face.
!pip install git+https://github.com/huggingface/transformers.git

Collecting git+https://github.com/huggingface/transformers.git
  Cloning https://github.com/huggingface/transformers.git to /tmp/pip-req-build-9g06dom_
  Running command git clone -q https://github.com/huggingface/transformers.git /tmp/pip-req-build-9g06dom_
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
    Preparing wheel metadata ... [?25l[?25hdone


In [2]:
## Import libraries
import pandas as pd ## data manipulation
import re ## For regular expressions
import numpy as np ## numerical processing
from random import randint ## random number generation
from pickle import load ## for serializing and de-serializing
import random ## random number generation

import spacy ## advanced natural language processing

from keras.models import load_model ## loading model
from keras.preprocessing.sequence import pad_sequences ## ensure sequences in a list have the same length
from keras.preprocessing.text import Tokenizer ## Splitting text to words

import keras ## Deep learning framework
from keras.models import Sequential ## Sequential model
from keras.layers import Dense,LSTM,Embedding ## Import necessary layers
from tensorflow.keras.utils import to_categorical ## Converts a class vector (integers) to binary class matrix
from pickle import dump,load ## dump and load models

In [3]:
 from transformers import pipeline ## Hugging face API

##2. Import Data

Data harvested from twitter API will be utilized to train the LSTM model. The data is based on trending tweets in Kenya.

In [1]:
## Mount google drive

from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [6]:
## Import dataset from drive

twitter_data = pd.read_csv('/content/drive/MyDrive/NLP/Datasets/Twitter/Location Trend Tweets 2022-04-03.csv')

## Using 1 sample due to RAM limitations
# twitter_data1 = pd.read_csv('/content/drive/MyDrive/Module 3/Datasets/Location Trend Tweets 2022-03-31.csv')
# twitter_data2 = pd.read_csv('/content/drive/MyDrive/NLP/Datasets/Twitter/Location Trend Tweets 2022-04-01.csv')
# twitter_data3 = pd.read_csv('/content/drive/MyDrive/NLP/Datasets/Twitter/Location Trend Tweets 2022-04-02.csv')
# twitter_data4 = pd.read_csv('/content/drive/MyDrive/NLP/Datasets/Twitter/Location Trend Tweets 2022-04-03.csv')
# twitter_data5 = pd.read_csv('/content/drive/MyDrive/NLP/Datasets/Twitter/Location Trend Tweets 2022-04-04.csv')
# twitter_data6 = pd.read_csv('/content/drive/MyDrive/NLP/Datasets/Twitter/Location Trend Tweets 2022-04-05.csv')
#twitter_data = pd.concat([twitter_data6, twitter_data5,twitter_data4,twitter_data3,twitter_data2,twitter_data1])

twitter_data.head()

Unnamed: 0,screen_name,hashtag,tweet,time_stamp
0,Gringo42106495,Kondele,RT @ekisiangani: I abhor violence anywhere in ...,2022-04-03 16:01:48+00:00
1,mwabilimwagodi,Kondele,RT @kipmurkomen: When DP Ruto team was attacke...,2022-04-03 16:01:43+00:00
2,PeterRatemo4,Kondele,"RT @JKNjenga: For the fourth day running, Kond...",2022-04-03 16:01:41+00:00
3,DavidChirchir,Kondele,RT @NahashonKimemia: President Uhuru Kenyatta ...,2022-04-03 16:01:05+00:00
4,MabawaYaMbu,Kondele,RT @Silvia_Wangeci: Even After ALL The Violenc...,2022-04-03 16:01:01+00:00


##3. Clean Data

A custom data function will be designed to clean the harvested tweets.

In [7]:
def text_cleaner (text):
  """ Function to clean text data. 

  Parameters
  ----------
  text : A string
    
  Returns
  -------
  text : Cleaned string.
    
  """

  text = re.sub(r'@[A-Za-z0-9]+','',str(text)) ## remove @ mentions
  text = re.sub(r'#','',str(text)) ## remove # symbol
  text = re.sub(r'^RT+','',str(text)) ## remove RT
  text = re.sub(r'https?:\/\/\S+','',str(text)) ## remove hyperlink
  text = re.sub(r'[^\w\s]','',str(text)) ## remove everything apart from words and space
  text = re.sub(r'_',' ',str(text)) ## remove underscore
  text = re.sub(r'\n',' ',str(text)) ## remove \n

  return text

In [8]:
## Create clean text column

twitter_data['cleaned_tweet'] = twitter_data['tweet'].apply(text_cleaner)

## Select necessary columns
twitter_data = twitter_data[['screen_name','hashtag','tweet','cleaned_tweet','time_stamp']]
twitter_data.head()

Unnamed: 0,screen_name,hashtag,tweet,cleaned_tweet,time_stamp
0,Gringo42106495,Kondele,RT @ekisiangani: I abhor violence anywhere in ...,I abhor violence anywhere in the world But l...,2022-04-03 16:01:48+00:00
1,mwabilimwagodi,Kondele,RT @kipmurkomen: When DP Ruto team was attacke...,When DP Ruto team was attacked in Kibera Uhu...,2022-04-03 16:01:43+00:00
2,PeterRatemo4,Kondele,"RT @JKNjenga: For the fourth day running, Kond...",For the fourth day running Kondele has remai...,2022-04-03 16:01:41+00:00
3,DavidChirchir,Kondele,RT @NahashonKimemia: President Uhuru Kenyatta ...,President Uhuru Kenyatta has condemned the a...,2022-04-03 16:01:05+00:00
4,MabawaYaMbu,Kondele,RT @Silvia_Wangeci: Even After ALL The Violenc...,Wangeci Even After ALL The Violence That Was...,2022-04-03 16:01:01+00:00


##4. Text Generation

To generate text we will employ 2 approaches:

1. Apply transfer learning. Use a pre-trained model from hugging face API.
2. Train a LSTM model on the tweets to generate text.

###4.1 Transformers Text Generation

A transformer is a deep learning model that adopts the mechanism of self-attention, differentially weighting the significance of each part of the input data. 
We will use the pre-trained GPT-Neo 1.3B model from hugging face for this task. GPT-Neo 1.3B is a transformer model designed using EleutherAI's replication of the GPT-3 architecture. GPT-Neo refers to the class of models, while 1.3B represents the number of parameters of this particular pre-trained model.
GPT-Neo 1.3B was trained on the Pile, a large scale curated dataset created by EleutherAI for the purpose of training this model. The datasets are from academic or professional sources.

In [9]:
## Download gpt3 model from hugging face.
%%time
generator = pipeline('text-generation', model='EleutherAI/gpt-neo-1.3B')

CPU times: user 20.1 s, sys: 7.87 s, total: 27.9 s
Wall time: 47.3 s


The transformer models requires seed text and minimum length of text to be generated. From data exploration there is alot of tweets on Kenyan election and constitutional reforms. The model will be fed seed text regarding this subject.

In [10]:
## Feed seed text to a transformer model
%%time
text = generator("Kenya constitution ammendmens" , do_sample=True, min_length=500)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


CPU times: user 20.8 s, sys: 140 ms, total: 21 s
Wall time: 21 s


In [11]:
## Print text
text
print(text[0]['generated_text'])

Kenya constitution ammendmens

The Kenyan constitution ammended in 2015 to provide for the establishment of a Constitutional Assembly that is charged with the task of drafting a new constitution. The constitution ammendments were initiated during the Kenyan elections


The model does a decent job attempting to spin a short article on this topic. Based on the fact the model was trained on academic and professional articles such results were expected.

###4.2 LSTM Text Generation

Long Short Term Memory Network is an advanced RNN, a sequential network, that allows information to persist. It is capable of handling the vanishing gradient problem faced by RNN. A recurrent neural network is also known as RNN is used for persistent memory.

The LSTM will be trained on a sequence of words to predict the next most probable word.

### 4.2.1 Text Pre-Processing

In [12]:
## Load spacy model
nlp = spacy.load('en',disable=['parser','tagger','ner'])

In [13]:
## Remove custom punctuation
def separate_punc (doc_text):
  """ Function to remove custom punctuation and tokenize text. 

  Parameters
  ----------
  text : A text document
    
  Returns
  -------
  text : Tokenized document with custom punctuation removed.
    
  """

  return [token.text.lower() for token in nlp(doc_text) if token.text not in '\n\n \n\n\n!"-#$%&()--.*+,-/:;<=>?@[\\]^_`{|}~\t\n ']

In [14]:
## Concatenate tweets to form a document
d = twitter_data['cleaned_tweet'].str.cat()
d

'  I abhor violence anywhere in the world But let us deal with the problem fairly and without discrimination If violence is  When DP Ruto team was attacked in Kibera Uhuru said its ok for politicians to be stonedWhen DP was attacked in KenolampEmb  For the fourth day running Kondele has remained the top Twitter trend in Kenya  This has not happened since Russia invaded  President Uhuru Kenyatta has condemned the attack on Former PM Raila Odingas chopper Alisema nini kuhusu Kondele Th  Wangeci Even After ALL The Violence That Was Meted Against Him in Kondele William Ruto Has Apologized to Raila Odinga on behal  When DP Ruto team was attacked in Kibera Uhuru said its ok for politicians to be stonedWhen DP was attacked in KenolampEmb  When DP Ruto team was attacked in Kibera Uhuru said its ok for politicians to be stonedWhen DP was attacked in KenolampEmb  When DP Ruto team was attacked in Kibera Uhuru said its ok for politicians to be stonedWhen DP was attacked in KenolampEmbStream Ameri

In [15]:
## Get document length
nlp.max_length = len(d)

In [16]:
## Tokenize text
tokens = separate_punc(d)

## Glimpse first few tokens
print(len(tokens))
tokens[:10]

42329


['  ',
 'i',
 'abhor',
 'violence',
 'anywhere',
 'in',
 'the',
 'world',
 'but',
 'let']

###4.2.2 Create Sequences

In [17]:
# organize into sequences of tokens
train_len = 25+1 # 25 training words , then one target word

# Empty list of sequences
text_sequences = []

## For loop  to show text sequences
for i in range(train_len, len(tokens)):
    
    # Grab train_len# amount of characters
    seq = tokens[i-train_len:i]
    
    # Add to list of sequences
    text_sequences.append(seq)

In [18]:
## First sequence
' '.join(text_sequences[0])

'   i abhor violence anywhere in the world but let us deal with the problem fairly and without discrimination if violence is when dp ruto team'

In [19]:
## Second sequence
' '.join(text_sequences[1])

'i abhor violence anywhere in the world but let us deal with the problem fairly and without discrimination if violence is when dp ruto team was'

In [20]:
## Third sequence
' '.join(text_sequences[2])

'abhor violence anywhere in the world but let us deal with the problem fairly and without discrimination if violence is when dp ruto team was attacked'

###4.2.3 Keras Tokenization

Text encoding is the process of transforming words into numbers and sequences of words into sequences of numbers. We will first tokenize the text and then convert the text to sequences.
To train the text generator model we need to convert the text data into a format digestable by the model.

In [21]:
# integer encode sequences of words
tokenizer = Tokenizer()
tokenizer.fit_on_texts(text_sequences)
sequences = tokenizer.texts_to_sequences(text_sequences)

In [22]:
## View first sequence
for i in sequences[0]:
    print(f'{i} : {tokenizer.index_word[i]}')

3 :   
16 : i
772 : abhor
111 : violence
697 : anywhere
4 : in
1 : the
347 : world
56 : but
268 : let
43 : us
602 : deal
24 : with
1 : the
557 : problem
771 : fairly
5 : and
346 : without
769 : discrimination
67 : if
111 : violence
8 : is
55 : when
49 : dp
34 : ruto
112 : team


In [23]:
## Get the text vocabulary size
vocabulary_size = len(tokenizer.word_counts)
vocabulary_size

5377

In [24]:
## Convert sequences to numpy array

sequences = np.array(sequences)

In [25]:
## View resulting array
sequences


array([[   3,   16,  772, ...,   49,   34,  112],
       [  16,  772,  111, ...,   34,  112,   10],
       [ 772,  111,  697, ...,  112,   10,   89],
       ...,
       [  82,   69, 2668, ..., 2671,    2,  603],
       [  69, 2668, 2669, ...,    2,  603,  269],
       [2668, 2669,   18, ...,  603,  269,   82]])

###4.2.4 Create LSTM Model

In [26]:
## Split X and y features

## All elements in sequence but last element
X = sequences[:,:-1]

## Last element in sequence
y = sequences[:,-1]

y = to_categorical(y, num_classes=vocabulary_size+1)

seq_len = X.shape[1]

seq_len

25

In [27]:
# define model
def create_model(vocabulary_size, seq_len):
  
  """ Function to define and compile LSTM model. 

  Parameters
  ----------
  vocabulary_size : Unique words in text corpus

  seq_len : Number of words to use in sequence
    
  Returns
  -------
  
  model: defined and compiled model
  """
  
  model = Sequential()
  model.add(Embedding(vocabulary_size, 25, input_length=seq_len))
  model.add(LSTM(150, return_sequences=True))
  model.add(LSTM(150))
  model.add(Dense(150, activation='relu'))

  model.add(Dense(vocabulary_size, activation='softmax'))
    
  model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
   
  model.summary()
    
  return model


In [28]:
## Summary of model
model = create_model(vocabulary_size+1, seq_len)

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 25, 25)            134450    
                                                                 
 lstm (LSTM)                 (None, 25, 150)           105600    
                                                                 
 lstm_1 (LSTM)               (None, 150)               180600    
                                                                 
 dense (Dense)               (None, 150)               22650     
                                                                 
 dense_1 (Dense)             (None, 5378)              812078    
                                                                 
Total params: 1,255,378
Trainable params: 1,255,378
Non-trainable params: 0
_________________________________________________________________


In [29]:
# fit model
model.fit(X, y, batch_size=256, epochs=100,verbose=1)

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78

<keras.callbacks.History at 0x7f54ea13a9d0>

In [30]:
# save the model to file
model.save('/content/drive/MyDrive/Module 3/Outputs/epochBIG.h5')
# save the tokenizer
dump(tokenizer, open('/content/drive/MyDrive/Module 3/Outputs/epochBIG', 'wb'))

###4.2.5 Generate Text

In [31]:
def generate_text(model, tokenizer, seq_len, seed_text, num_gen_words):
  ''' Parameters
    ----------

  model : model that was trained on text data
  tokenizer : tokenizer that was fit on text data
  seq_len : length of training sequence
  seed_text : raw string text to serve as the seed
  num_gen_words : number of words to be generated by model

  Returns
  -------
  
  generated text: generated text of specified length


  '''
    
  # Final Output
  output_text = []
    
  # Intial Seed Sequence
  input_text = seed_text
    
  # Create num_gen_words
  for i in range(num_gen_words):
        
      # Take the input text string and encode it to a sequence
      encoded_text = tokenizer.texts_to_sequences([input_text])[0]
        
      # Pad sequences to our trained rate (50 words in the video)
      pad_encoded = pad_sequences([encoded_text], maxlen=seq_len, truncating='pre')
        
      # Predict Class Probabilities for each word
      pred_word_ind = np.argmax(model.predict(pad_encoded), axis=-1)[0]

        
      # Grab word
      pred_word = tokenizer.index_word[pred_word_ind] 
        
      # Update the sequence of input text (shifting one over with the new word)
      input_text += ' ' + pred_word
        
      output_text.append(pred_word)
        
  # Make it look like a sentence.
  return ' '.join(output_text)

In [32]:
## Randomly select 1 sequence
random.seed(101)
random_pick = random.randint(0,len(text_sequences))

In [33]:
## View selected sequence
random_seed_text = text_sequences[random_pick]
random_seed_text

['you',
 'against',
 'other',
 'communities',
 'desist',
 'from',
 'overreactingm',
 'those',
 'are',
 'kalenjins',
 'killing',
 'each',
 'other',
 'because',
 'of',
 'there',
 'selfish',
 'interest',
 'not',
 'about',
 'high',
 'cost',
 'of',
 'living',
 'wacha',
 'upuzi']

In [34]:
## Join seed text
seed_text = ' '.join(random_seed_text)
seed_text

'you against other communities desist from overreactingm those are kalenjins killing each other because of there selfish interest not about high cost of living wacha upuzi'

In [35]:
## Load saved model
model = load_model('/content/drive/MyDrive/Module 3/Outputs/epochBIG.h5')

In [36]:
## Generate next 10 words from sequence
generate_text(model,tokenizer,seq_len,seed_text=seed_text,num_gen_words=10)

'brian dear kalenjinsbe very carefulsomebody is working hard to turn'

For a training of 100 epochs the LSTM model has some commendable performance. The model attains ~ 83% accuracy. The model predicts a sequence of sensible words as output.

## Conclusion

- Both transformer model and LSTM model had promising results.
- An interesting area to pursue given more time is to train transformer model on custom data.
- Training the LSTM model for more epochs also looks promising for output improvement.