## Headline Generation

This is a starter kernel which uses a dataset of a million headlines to model language for (real) news headlines. Then, using this language model we will generate headlines.

* Language Model: Learn to predict the next word given some context.
* Text Generation: Continuously sample from the model until the end of headline character is reached.

## Data Processing

First we will read data from [A Million News Headlines](https://www.kaggle.com/therohk/million-headlines).

In [None]:
import pandas as pd

million = pd.read_csv('../input/million-headlines/abcnews-date-text.csv', delimiter=',', nrows=200000)
data = million.drop(['publish_date'], axis=1).rename(columns={'headline_text': 'headline'})

We now need to convert the textual data into a format appropriate for training (ie. a vector).

First though, we need to take a small step back and think about the generation process as a whole. We want to be able to generate text over and over again until... when? Maybe when we generate 10 words, we call it quits. Would that be desirable though? No, since most sentences/titles are not just ten words; some have more words, some fewer. Also, stopping generation after a fixed number of words may result in text that is abruptly cut. To avoid these problems, we are going to ask the model when to stop generation. How? We will add a stop token and the model will learn to predict it like any other word. So, we are going to add a special end character at the end of each training item. We will also add another special character at the start of each item, to denote the beginning of each sentence.

These special characters are simply '÷' and '■' (start/end respectively).

In [None]:
START = '÷'
END = '■'

After appending the start and end tokens, we are going to tokenize our text using the `keras` built-in libraries. Next, after the tokenization, we need to format our data for training. When generating text, we are given a sequence of words and want to predict the next one in line. This is what we are going to do here: we will create `(context, next_word)` pairs. In our case `context` is a variable-length sequence of words, but it can be n-grams. Given a list of words, we want to predict `next_word`.

In [None]:
import numpy as np
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.utils import to_categorical

def format_data(data, max_features, maxlen, shuffle=False):
    if shuffle:
        data = data.sample(frac=1).reset_index(drop=True)
    
    # Add start and end tokens
    data['headline'] = START + ' ' + data['headline'].str.lower() + ' ' + END

    text = data['headline']
    
    # Tokenize text
    filters = "!\"#$%&()*+,-./;<=>?@[\\]^_`{|}~\t\n"
    tokenizer = Tokenizer(num_words=max_features, filters=filters)
    tokenizer.fit_on_texts(list(text))
    corpus = tokenizer.texts_to_sequences(text)
    
    # Build training sequences of (context, next_word) pairs.
    # Note that context sequences have variable length. An alternative
    # to this approach is to parse the training data into n-grams.
    X, Y = [], []
    for line in corpus:
        for i in range(1, len(line)-1):
            X.append(line[:i+1])
            Y.append(line[i+1])
    
    # Pad X and convert Y to categorical (Y consisted of integers)
    X = pad_sequences(X, maxlen=maxlen)
    Y = to_categorical(Y, num_classes=max_features)

    return X, Y, tokenizer

Let's tokenize our data!

In [None]:
max_features, max_len = 3500, 20
X, Y, tokenizer = format_data(data, max_features, max_len)

The tokenizer is basically a large dictionary of `{word: index}` items. We are going to print indices for a random word and the two special tokens (to make sure they are included in the tokenizer).

In [None]:
tokenizer.word_index['trump'], tokenizer.word_index[START], tokenizer.word_index[END]

## Model Building

Now we are ready to build our model. As usual, we will first pass the input through an embedding layer. On these embeddings, a bidirectional GRU layer will operate and will produce an output for the Dense classifier. The output of this final Dense layer is the probability distribution of all words.

In [None]:
from keras.layers import Input, Dense, Bidirectional, GRU, Embedding, Dropout, LSTM
from keras.layers import concatenate, SpatialDropout1D, GlobalAveragePooling1D, GlobalMaxPooling1D
from keras.models import Model, Sequential

epochs = 3

model = Sequential()

# Embedding and GRU
model.add(Embedding(max_features, 300))
model.add(SpatialDropout1D(0.33))
model.add(Bidirectional(LSTM(30)))

# Output layer
model.add(Dense(max_features, activation='softmax'))

model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
model.fit(X, Y, epochs=epochs, batch_size=128, verbose=1)

model.save_weights('model{}.h5'.format(epochs))

In [None]:
model.evaluate(X, Y)

## Text Generation

With the model at hand, we are ready to start generating text. We are going to feed the starting character, '÷', to the model and then continuously sample and input generated text until we reach the end character, '■'.

In our sampling function, we are going to adjust the softmax probabilities by temperature.

In [None]:
def sample(preds, temp=1.0):
    """
    Sample next word given softmax probabilities, using temperature.
    
    Taken and modified from:
    https://github.com/keras-team/keras/blob/master/examples/lstm_text_generation.py
    """
    preds = np.asarray(preds).astype('float64')
    preds = np.log(preds) / temp
    preds = np.exp(preds) / np.sum(np.exp(preds))
    probs = np.random.multinomial(1, preds, 1)
    return np.argmax(probs)

In the generation function, we are given an input text. We will tokenize this input and pad it so that it is in the correct form. Then, we will feed the tokenized input into the model to compute the probability distribution of next words. We will sample from this distribution and add the selected word to the generated sentence. Then, we will feed the generated sentence in its whole to the network and generate the next word (alternatively, we can only feed part of the generated sentence). We repeat this proess until we generate the end token ('■').

In [None]:
"""When sampling from the distribution, we do not know which word is being
sampled, only its index. We need a way to go from index to word. Unfortunately,
the tokenizer class only contains a dictionary of {word: index} items. We will
reverse that dictionary to get {index: word} items. That way, going from
indices to words is much faster."""
idx_to_words = {value: key for key, value in tokenizer.word_index.items()}


def process_input(text):
    """Tokenize and pad input text"""
    tokenized_input = tokenizer.texts_to_sequences([text])[0]
    return pad_sequences([tokenized_input], maxlen=max_len-1)


def generate_text(input_text, model, n=7, temp=1.0):
    """Takes some input text and feeds it to the model (after processing it).
    Then, samples a next word and feeds it back into the model until the end
    token is produced.
    
    :input_text: A string or list of strings to be used as a generation seed.
    :model:      The model to be used during generation.
    :temp:       A float that adjusts how 'volatile' predictions are. A higher
                 value increases the chance of low-probability predictions to
                 be picked."""
    if type(input_text) is str:
        sent = input_text
    else:
        sent = ' '.join(input_text)
    
    tokenized_input = process_input(input_text)
    
    while True:
        preds = model.predict(tokenized_input, verbose=0)[0]
        pred_idx = sample(preds, temp=temp)
        pred_word = idx_to_words[pred_idx]
        
        if pred_word == END:
            return sent
        
        sent += ' ' + pred_word
#         print(sent)
#         tokenized_input = process_input(sent[-n:])
        tokenized_input = process_input(sent)

We can now start generating text, using the starting symbol as input.

In [None]:
text = generate_text(START, model, temp=0.01)
text[2:] # the first two elements are '÷ '

In [None]:
text = generate_text(START, model, temp=0.25)
text[2:] # the first two elements are '÷ '

In [None]:
text = generate_text(START, model, temp=0.5)
text[2:] # the first two elements are '÷ '

In [None]:
text = generate_text(START, model, temp=0.75)
text[2:] # the first two elements are '÷ '

In [None]:
text = generate_text(START, model, temp=1.0)
text[2:] # the first two elements are '÷ '