# Generating Headlines in English Using LSTM

In today’s fast-paced digital world, the ability to create compelling and relevant headlines is crucial for capturing audience attention and driving engagement. Headlines serve as the first impression of content, influencing readers' decisions to explore articles further. Given the growing volume of content and the demand for timely information, automating the headline creation process presents a significant opportunity.

This project addresses this need by leveraging advanced machine learning techniques to automate the generation of headlines. The core of this approach is based on Long **Short-Term Memory (LSTM) networks, a type of Recurrent Neural Network (RNN)** renowned for its ability to handle sequences and long-term dependencies in data.



## Why LSTM for Headline Generation?

Traditional algorithms for text generation often struggle with maintaining coherence over longer sequences, leading to headlines that may lack relevance or readability. LSTMs, with their specialized architecture, are designed to remember and use contextual information from earlier parts of the sequence. This makes them particularly effective for generating text that is not only grammatically correct but also contextually appropriate.

## Project Goals

The objective is to develop an **LSTM-based model** that can generate high-quality, engaging headlines in English. By training the model on a diverse dataset of existing headlines, we aim to produce headlines that are not only accurate but also creative and relevant. This model has the potential to assist content creators, journalists, and marketers by providing them with a tool to quickly generate impactful headlines, thereby enhancing productivity and content engagement.

## 1. Reading the dataset

In [1]:
with open("/content/dataset.txt", encoding="latin-1") as f:
    dataset = f.read().splitlines()

In [2]:
dataset[:10]

['New energy law promises to revolutionize the electric sector',
 'Climate change continues to be a global threat',
 'Investors seek opportunities in renewable energy',
 'Demand for electric vehicles increases',
 'COVID-19 vaccines: When will we all be protected?',
 'The debate over vaccines continues to divide opinions',
 'Health experts analyze the effectiveness of vaccines',
 'Mass vaccination against coronavirus underway',
 'Cryptocurrency market soars to new heights',
 'Is Bitcoin the currency of the future?']

## 2. Data Preparation

### 2.1 Data Cleaning

In [3]:
import string
import unicodedata

def clean_and_normalize_text(txt):
    # Remove punctuation and convert to lowercase
    txt = "".join(c for c in txt if c not in string.punctuation).lower()
    # Normalize unicode characters and encode to ASCII
    txt = unicodedata.normalize('NFKD', txt).encode('ascii', 'ignore').decode('utf-8', 'ignore')
    return txt

In [4]:
dataset = [clean_and_normalize_text(headline) for headline in dataset]

In [5]:
dataset[:10]

['new energy law promises to revolutionize the electric sector',
 'climate change continues to be a global threat',
 'investors seek opportunities in renewable energy',
 'demand for electric vehicles increases',
 'covid19 vaccines when will we all be protected',
 'the debate over vaccines continues to divide opinions',
 'health experts analyze the effectiveness of vaccines',
 'mass vaccination against coronavirus underway',
 'cryptocurrency market soars to new heights',
 'is bitcoin the currency of the future']

The function `clean_and_normalize_text` is designed to prepare text data for further processing by cleaning and standardizing it. This is a crucial step in text analysis and natural language processing. The function accomplishes the following:

**Remove Unwanted Characters:**

* **Objective:** Eliminate punctuation marks from the text.
* **Why:** Punctuation can interfere with text analysis tasks such as text classification or tokenization. Removing it **helps in focusing on the core content of the text.**

**Standardize Text:**

* **Objective:** Normalize the text by converting it to lowercase and removing any special or accented characters.
* **Why:** Converting the text to lowercase ensures uniformity, as "Hello" and "hello" would be treated as the same word. Normalizing accents and special characters helps in handling text from different sources and languages consistently, making it easier to analyze and compare.

`clean_and_normalize_text` transforms raw text into a cleaner, more uniform format. This preprocessing step is essential for effective text analysis, improving the accuracy and reliability of subsequent processing tasks such as machine learning model training or text-based querying.

### 2.2. Data Tokenization

In [6]:
from tensorflow.keras.preprocessing.text import Tokenizer

tokenizer = Tokenizer()

In [7]:
def generate_token_sequences(tokenizer, dataset):
    # Build the tokenizer
    tokenizer.fit_on_texts(dataset)
    total_words = len(tokenizer.word_index) + 1

    # Tokenize the text in the dataset
    dataset_tokens = []
    for text in dataset:
        text_tokens = tokenizer.texts_to_sequences([text])[0]
        # Generate n-grams from the tokenized text
        for i in range(1, len(text_tokens)):
            n_gram = text_tokens[:i+1]
            dataset_tokens.append(n_gram)

    return dataset_tokens, total_words


In [8]:
dataset_tokens, total_words = generate_token_sequences(tokenizer, dataset)

The function `generate_token_sequences` is used to preprocess a text dataset by tokenizing the text and generating n-grams. It first builds a tokenizer based on the dataset, then converts each text into sequences of integers, and finally creates and collects various n-grams (sub-sequences) from the tokenized text. This preprocessing step is essential for transforming raw text data into a structured format **suitable for training machine learning models or performing further text analysis.**