# Generating Headlines in English Using Transformers

In today’s fast-paced digital landscape, the ability to create engaging and relevant headlines is essential for capturing audience attention and driving interaction. Headlines are often the first encounter readers have with content, and their effectiveness can significantly influence whether an article is read further. With the increasing volume of content and the need for timely updates, automating the headline generation process has become an important opportunity.

This project aims to address this challenge by utilizing **cutting-edge Transformer architecture**, a powerful machine learning model designed for processing sequences. Unlike traditional models, **Transformers leverage self-attention mechanisms to efficiently handle long-range dependencies and capture intricate relationships within the data**. This approach enables the model to generate coherent and contextually relevant headlines by understanding and integrating the nuances of the input text, ultimately enhancing the automation and quality of headline creation.

## Why Transformers for Headline Generation?

Traditional algorithms for text generation often face challenges in capturing and maintaining long-range dependencies within the text, which can result in headlines that lack coherence or relevance. Transformers, with their self-attention mechanism, are designed to address this issue effectively. Unlike earlier models, **Transformers can analyze and integrate information from different parts of the sequence simultaneously**, allowing them to generate headlines that are not only grammatically sound but also contextually nuanced and relevant. This capability makes Transformers particularly suited for generating compelling and coherent headlines, even over long sequences of text.

## Project Goals

The aim of this project is to develop a **Transformer-based** model capable of generating high-quality, engaging headlines in English. By training the model on a comprehensive dataset of existing headlines, we seek to produce headlines that are both accurate and creatively crafted. The Transformer’s advanced architecture will enable the generation of headlines that are contextually rich and impactful. This model is intended to support content creators, journalists, and marketers by offering a powerful tool to quickly generate captivating headlines, ultimately boosting productivity and enhancing content engagement.

In [1]:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras.layers import TextVectorization
import numpy as np
import os
import string
import random

The `causal_attention_mask` function creates a mask that ensures each word in a sequence can only be influenced by previous words and itself, not by future words. This is crucial for text generation models, as it prevents the model from using future information to make predictions, ensuring that the generated text is coherent and correctly ordered.

In [2]:
def causal_attention_mask(batch_size, n_dest, n_src, dtype):
    """
    Creates a causal attention mask to ensure that each token in a sequence only attends to previous and current tokens,
    but not to future tokens. This is crucial for autoregressive models where each token should not be influenced
    by tokens that come after it in the sequence.

    The mask is designed to be applied to the attention weights in self-attention mechanisms, such as those used
    in Transformer models. It prevents information from flowing from future tokens to the current token, ensuring
    that predictions for each token depend only on tokens that precede it.

    Parameters:
    - batch_size (int): The number of sequences in the batch.
    - n_dest (int): The length of the destination sequence (number of tokens in the sequence being processed).
    - n_src (int): The length of the source sequence (typically equal to n_dest in self-attention).
    - dtype (tf.DType): The data type for the mask tensor (e.g., tf.float32, tf.int32).

    Returns:
    - tf.Tensor: A tensor of shape [batch_size, n_dest, n_src] where the upper triangle of the dot product matrix
      is masked out with zeros, and the lower triangle (including the diagonal) is filled with ones. This tensor
      can be used to mask the attention weights in a self-attention mechanism, ensuring that each token attends only
      to earlier tokens and itself, but not to future tokens.

    Example:
    >>> causal_mask = causal_attention_mask(2, 4, 4, tf.float32)
    >>> print(causal_mask)
    <tf.Tensor: shape=(2, 4, 4), dtype=float32, numpy=
    array([[[1., 0., 0., 0.],
            [1., 1., 0., 0.],
            [1., 1., 1., 0.],
            [1., 1., 1., 1.]],

           [[1., 0., 0., 0.],
            [1., 1., 0., 0.],
            [1., 1., 1., 0.],
            [1., 1., 1., 1.]]], dtype=float32)>
    """
    i = tf.range(n_dest)[:, None]
    j = tf.range(n_src)
    m = i >= j - n_src + n_dest
    mask = tf.cast(m, dtype)
    mask = tf.reshape(mask, [1, n_dest, n_src])
    mult = tf.concat(
        [tf.expand_dims(batch_size, -1), tf.constant([1, 1], dtype=tf.int32)], 0
    )
    return tf.tile(mask, mult)

The `TransformerBlock` class is a fundamental building block in modern deep learning models, especially those designed for handling sequences, like text. Imagine you're building a sophisticated machine that can understand and generate language—like a powerful translator or a smart chatbot. This machine needs to process sequences of words and understand the relationships between them. That’s where the TransformerBlock comes in.

In [3]:
class TransformerBlock(layers.Layer):
    """
    A single block of the Transformer model architecture. This block combines multi-head self-attention
    and feed-forward neural networks to process input sequences.

    The TransformerBlock is designed to capture complex dependencies in sequential data by using self-attention
    mechanisms. It also includes feed-forward layers to further process the attention outputs, along with normalization
    and dropout layers to stabilize training and prevent overfitting.

    Attributes:
    - embed_dim (int): The dimension of the embedding space.
    - num_heads (int): The number of attention heads in the multi-head attention mechanism.
    - ff_dim (int): The dimension of the feed-forward network hidden layer.
    - rate (float): The dropout rate applied to the attention and feed-forward layers (default is 0.1).

    Methods:
    - call(inputs): Executes the forward pass of the Transformer block. It applies the multi-head attention, adds
      residual connections, normalizes the outputs, and processes them through a feed-forward network.

    Parameters:
    - inputs (tf.Tensor): Input tensor with shape (batch_size, seq_len, embed_dim). Represents the sequence of embeddings.

    Returns:
    - tf.Tensor: Output tensor with shape (batch_size, seq_len, embed_dim). The processed sequence after attention,
      feed-forward operations, and normalization.

    Example:
    >>> transformer_block = TransformerBlock(embed_dim=64, num_heads=4, ff_dim=128)
    >>> inputs = tf.random.uniform((32, 10, 64))  # Example input tensor with batch_size=32, seq_len=10, embed_dim=64
    >>> output = transformer_block(inputs)
    >>> print(output.shape)
    (32, 10, 64)
    """
    def __init__(self, embed_dim, num_heads, ff_dim, rate=0.1):
        super().__init__()
        self.att = layers.MultiHeadAttention(num_heads, embed_dim)
        self.ffn = keras.Sequential(
            [layers.Dense(ff_dim, activation="relu"), layers.Dense(embed_dim),]
        )
        self.layernorm1 = layers.LayerNormalization(epsilon=1e-6)
        self.layernorm2 = layers.LayerNormalization(epsilon=1e-6)
        self.dropout1 = layers.Dropout(rate)
        self.dropout2 = layers.Dropout(rate)

    def call(self, inputs):
        input_shape = tf.shape(inputs)
        batch_size = input_shape[0]
        seq_len = input_shape[1]
        causal_mask = causal_attention_mask(batch_size, seq_len, seq_len, tf.bool)
        attention_output = self.att(inputs, inputs, attention_mask=causal_mask)
        attention_output = self.dropout1(attention_output)
        out1 = self.layernorm1(inputs + attention_output)
        ffn_output = self.ffn(out1)
        ffn_output = self.dropout2(ffn_output)
        return self.layernorm2(out1 + ffn_output)


At its core, the `TransformerBlock` is designed to help the machine focus on different parts of a sequence when making decisions. For example, if you’re translating a sentence, the model needs to understand which words in the sentence are related to each other, even if they are far apart. The TransformerBlock achieves this using two **key mechanisms: attention and feed-forward processing.**

* **Self-Attention Mechanism:**
Think of this as the block's way of looking at all the words in a sentence and figuring out which words should be paid more attention to. It does this by calculating how each word relates to every other word in the sequence. This process is known as self-attention. For instance, in the sentence "The cat sat on the mat," the **TransformerBlock** helps the model understand that "cat" and "mat" are related, even though they are not next to each other.

* **Feed-Forward Network:**
After understanding these relationships, the next step is to further process this **information to make it more useful**. This is where the feed-forward network comes in. **It takes the attention outputs and applies additional transformations to refine the information**. **This step helps in capturing complex patterns and making the final output more precise.**

* **Normalization and Dropout:**
To **ensure the model learns effectively and doesn’t overfit to the training data**, the TransformerBlock includes normalization and dropout layers. Normalization helps in stabilizing the training process by adjusting the outputs to a standard scale. Dropout, on the other hand, randomly "drops" some of the data during training to **prevent the model from becoming too dependent on any specific part of the input.**

In essence, the `TransformerBlock` acts as a smart processor within a larger machine learning model. It helps the model focus on important parts of the input sequence, processes this information in a sophisticated way, and ensures that the learning process is stable and robust. This makes it an essential component in creating models that can understand and generate human-like text, perform translations, or even respond intelligently in conversations.

The `TokenAndPositionEmbedding` class is designed to convert sequences of tokens into meaningful numerical representations for a machine learning model. **This class addresses two key aspects:**

* **Token Representation:** It translates each token in the input sequence into a **dense vector**, known as a token embedding. **This vector captures the semantic information of the token**, allowing the model to interpret its meaning.

* **Positional Information:** To capture the order of tokens within the sequence, it generates **positional embeddings**. These embeddings encode the position of each token, ensuring that the model understands the sequential context and the relative positioning of tokens.

By combining **token embeddings with positional embeddings, the class provides a comprehensive representation of each token that includes both its meaning and its position in the sequence**. This combined representation is crucial for models to process and understand sequences accurately, enhancing their ability to perform tasks that depend on the order and context of the tokens.

In [4]:
class TokenAndPositionEmbedding(layers.Layer):
    """
    A custom layer that combines token embeddings and positional embeddings for sequences.
    This layer is designed to convert input tokens into dense vectors and add positional information
    to each token embedding to capture the order of tokens in a sequence.

    The TokenAndPositionEmbedding layer is crucial for models that process sequential data, such as
    natural language processing models, where understanding the position of each token in the sequence
    is essential for interpreting the context and meaning.

    Attributes:
    - maxlen (int): The maximum length of the input sequences. This determines the size of the positional
      embeddings.
    - vocab_size (int): The size of the vocabulary, which determines the number of possible tokens.
    - embed_dim (int): The dimensionality of the embedding space. Each token and position is mapped to a vector of
      this size.

    Methods:
    - call(x): Applies the token and positional embeddings to the input sequences. It generates embeddings for each
      token and adds positional embeddings to these token embeddings to encode the order of tokens in the sequence.

    Parameters:
    - x (tf.Tensor): Input tensor of shape (batch_size, sequence_length), where each value represents a token index
      in the input sequences.

    Returns:
    - tf.Tensor: Output tensor of shape (batch_size, sequence_length, embed_dim), where each token index in the input
      sequences has been converted into an embedding vector, with positional information added to it.

    Example:
    >>> embedding_layer = TokenAndPositionEmbedding(maxlen=100, vocab_size=5000, embed_dim=64)
    >>> input_seq = tf.constant([[1, 5, 9], [2, 6, 3]])
    >>> output = embedding_layer(input_seq)
    >>> print(output.shape)
    (2, 3, 64)
    """
    def __init__(self, maxlen, vocab_size, embed_dim):
        super().__init__()
        self.token_emb = layers.Embedding(input_dim=vocab_size, output_dim=embed_dim)
        self.pos_emb = layers.Embedding(input_dim=maxlen, output_dim=embed_dim)

    def call(self, x):
        maxlen = tf.shape(x)[-1]
        positions = tf.range(start=0, limit=maxlen, delta=1)
        positions = self.pos_emb(positions)
        x = self.token_emb(x)
        return x + positions


## Data Reading and Preprocessing

### Data Loading

In [5]:
batch_size = 128

# The dataset contains each review in a separate text file
# The text files are present in four different folders
# Create a list all files

file = "/content/dataset.txt"

# Create a dataset from text files
text_ds = tf.data.TextLineDataset(file)
text_ds = text_ds.shuffle(buffer_size=256)
text_ds = text_ds.batch(batch_size)

The `custom_standardization` function is designed to prepare text data for further processing by cleaning it up in a few key ways. Imagine you’re working with a collection of text that needs to be uniformly formatted to ensure consistency before feeding it into a machine learning model.



In [6]:
def custom_standardization(input_string):
    """Remove html line-break tags and handle punctuation"""
    lowercased = tf.strings.lower(input_string)
    stripped_html = tf.strings.regex_replace(lowercased, "<br />", " ")
    return tf.strings.regex_replace(stripped_html, f"([{string.punctuation}])", r" \1")

Here’s how the function does this:

* **Lowercasing:** First, the function converts all characters in the text to lowercase. This step is crucial because it ensures that the text is uniform—"Hello" and "hello" will be treated as the same word. **This simplification helps the model handle text more effectively without being thrown off by differences in capitalization.**

* **Removing HTML Line-Break Tags:** Next, the function targets HTML line-break tags (<br />) which often appear in web-scraped text or HTML content. These tags are meant to indicate a new line, but they don’t carry meaningful content for text analysis. The function replaces these tags with spaces to ensure that line breaks don’t disrupt the flow of the text.

* **Handling Punctuation:** Finally, the function deals with punctuation marks. Punctuation can sometimes create issues in text processing, especially if it’s not uniformly handled. The function identifies any punctuation marks and ensures they are surrounded by spaces. This helps in keeping punctuation separate from the words, making it easier to analyze and process the text.

In essence, `custom_standardization` prepares raw text by making it consistently formatted and free of unnecessary HTML tags and improperly handled punctuation. This step is fundamental in ensuring that the text is clean and uniform, setting the stage for more effective and accurate analysis or modeling.

In [7]:
vocab_size = 20000  # Only consider the top 20k words
maxlen = 80  # Max sequence size

# Create a vectorization layer and adapt it to the text
vectorize_layer = TextVectorization(
    standardize=custom_standardization,
    max_tokens=vocab_size - 1,
    output_mode="int",
    output_sequence_length=maxlen + 1,
)

In preparing text data for a machine learning model, this code snippet is setting up a crucial tool for **transforming raw text into a format that the model can use.**

First, it defines the **scope of the vocabulary**, choosing to focus on the top 20,000 most frequently occurring words. This helps keep the model’s vocabulary manageable and relevant, filtering out less common terms that might clutter the data. It also establishes that each sequence of text will be no longer than 80 tokens, ensuring that all inputs are of a consistent size.

To turn the text into a numerical format, the code creates a TextVectorization layer. This layer will process the text by first cleaning it up using the custom_standardization function, which handles things like removing unnecessary HTML tags and punctuation. Then, it converts the text into integer indices, where each word is mapped to a unique number based on its frequency.

The output sequences are slightly longer than the maximum length specified—set to 81 tokens—to accommodate additional training needs, like predicting the next token in a sequence.

In [8]:
# Preparation of the dictionary/vocabulary
vectorize_layer.adapt(text_ds)
vocab = vectorize_layer.get_vocabulary()  # To get words back from token indices

NotFoundError: {{function_node __wrapped__IteratorGetNext_output_types_1_device_/job:localhost/replica:0/task:0/device:CPU:0}} /content/dataset.txt; No such file or directory [Op:IteratorGetNext] name: 

After setting up the text vectorization layer, the next step is to prepare the vocabulary that the model will use.

This snippet begins by adapting the **vectorize_layer** to the text data. This process involves analyzing the text dataset to build a vocabulary based on the most frequent words. Essentially, the layer learns which words are present and how often they occur.

Once this adaptation is complete, the code retrieves the vocabulary using **vectorize_layer.get_vocabulary()**. This vocabulary is a list where each word is associated with a unique index. This step is crucial because it allows you to convert these indices back into words when interpreting the model's output or debugging the results.

In [9]:
def prepare_lm_inputs_labels(text):
    """
    Shift word sequences by 1 position so that the target for position (i) is
    word at position (i+1). The model will use all words up till position (i)
    to predict the next word.
    """
    text = tf.expand_dims(text, -1)
    tokenized_sentences = vectorize_layer(text)
    x = tokenized_sentences[:, :-1]
    y = tokenized_sentences[:, 1:]
    return x, y

The `prepare_lm_inputs_labels` function is designed to prepare text data for training a language model by creating input-output pairs that reflect how the model will learn to predict the next word in a sequence.

**Here’s how it works:**

* **Expanding Dimensions:** The function starts by adding an extra dimension to the text data. This adjustment is necessary for compatibility with the vectorization layer, which expects input in a specific shape.

* **Tokenization:** Next, the text is converted into numerical indices using the vectorize_layer. This process transforms each word in the text into its corresponding token ID, creating a sequence of integers.

* **Creating Input-Output Pairs:** The function then prepares two key components:

* **Inputs (x):** It takes all tokens except the last one in each sequence. These tokens serve as the input for the model.
* **Labels (y):** It shifts the sequence by one position to the right, so each token in the sequence corresponds to the next word in the original text. These tokens act as the target output the model should learn to predict.

In essence, this function arranges the text data into sequences where the model learns to predict the next word based on the words it has seen so far. This setup is fundamental for training a language model to generate coherent and contextually appropriate text.

In [None]:
text_ds = text_ds.map(prepare_lm_inputs_labels)
text_ds = text_ds.prefetch(tf.data.AUTOTUNE)