### Import Libraries
This cell imports the necessary libraries, including PyTorch for deep learning and NLTK for tokenizing text. It also sets a random seed to ensure reproducibility of results.


In [1]:
# Import Necessary Libraries

import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from tqdm import tqdm
from rouge import Rouge
import os
from collections import Counter
import nltk
from nltk.tokenize import word_tokenize
nltk.download('punkt', quiet=True)
nltk.download('punkt_tab', quiet=True)

[nltk_data] Error loading punkt: <urlopen error [Errno 11001]
[nltk_data]     getaddrinfo failed>
[nltk_data] Error loading punkt_tab: <urlopen error [Errno 11001]
[nltk_data]     getaddrinfo failed>


False

### BiLSTM Model for Text Summarization

The `BiLSTMSummarizer` class implements a Bidirectional LSTM-based model designed for text summarization. Below is a detailed explanation of its components and the overall forward pass process:

1. **Embedding Layer**:
    - This layer converts input tokens, represented as indices, into dense vector representations (embeddings) of size `embedding_dim`. It serves as the initial transformation of words into numerical form that can be processed by the model.

2. **Bidirectional LSTM Encoder**:
    - The encoder consists of a bidirectional LSTM (`nn.LSTM`), which processes the input sequence in both forward and backward directions. This allows the model to capture context from both sides of the input sentence, improving its understanding of long-range dependencies. The LSTM’s hidden size is specified by `hidden_dim`.

3. **LSTM Decoder**:
    - The decoder is a standard LSTM that receives the concatenated hidden states from both directions of the encoder. It generates the summary sequence one token at a time, using either teacher forcing (feeding in the actual next token) or the previously predicted token.

4. **Fully Connected Output Layer**:
    - After the decoder processes the sequence, the output is passed through a fully connected layer, which maps the hidden states to a probability distribution over the vocabulary. This distribution is used to predict the next word in the summary.

#### Forward Pass Workflow:

- **Inputs**:
    - `src`: The input article or text sequence.
    - `trg`: The target sequence, representing the summary.
    - `teacher_forcing_ratio`: A parameter that determines how often teacher forcing is applied. During training, it controls the probability of using the correct next token (from `trg`) instead of the predicted token.

- **Steps**:
    1. The source sequence (`src`) is embedded into dense vectors using the embedding layer.
    2. These embedded vectors are processed by the bidirectional LSTM encoder, which generates hidden states for both forward and backward passes.
    3. The hidden states from the two directions are concatenated.
    4. The decoder LSTM generates the target sequence (summary) token by token. Depending on the value of `teacher_forcing_ratio`, the decoder either receives the true next token or its own prediction as input at each step.

- **Output**:
    - The model outputs a sequence of predicted tokens representing the generated summary, with each token being the most likely word based on the decoder’s output probabilities.

In [2]:
# Define the BiLSTM model for text summarization
class BiLSTMSummarizer(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim, output_dim):
        super(BiLSTMSummarizer, self).__init__()
        # Embedding layer to convert input words to word embeddings
        self.embedding = nn.Embedding(vocab_size, embedding_dim)

        # LSTM encoder with bidirectionality to capture context from both directions
        self.encoder = nn.LSTM(embedding_dim, hidden_dim, bidirectional=True, batch_first=True)

        # Decoder with LSTM, input is the output from the encoder (concatenated hidden states)
        self.decoder = nn.LSTM(embedding_dim, hidden_dim * 2, batch_first=True)

        # Fully connected layer to map the hidden states to the vocabulary size (output)
        self.fc = nn.Linear(hidden_dim * 2, output_dim)

    # Forward pass through the model
    def forward(self, src, trg, teacher_forcing_ratio=0.5):
        batch_size = src.shape[0]  # Get batch size
        trg_len = trg.shape[1]  # Get the length of the target sequence
        trg_vocab_size = self.fc.out_features  # Get the output vocabulary size

        # Initialize the output tensor with zeros (batch_size, trg_len, vocab_size)
        outputs = torch.zeros(batch_size, trg_len, trg_vocab_size).to(src.device)

        # Pass the source sentence through the embedding layer
        embedded = self.embedding(src)

        # Pass the embeddings through the bidirectional LSTM encoder
        enc_output, (hidden, cell) = self.encoder(embedded)

        # Combine the hidden states from both directions (concatenate)
        hidden = torch.cat((hidden[-2,:,:], hidden[-1,:,:]), dim=1).unsqueeze(0)
        cell = torch.cat((cell[-2,:,:], cell[-1,:,:]), dim=1).unsqueeze(0)

        # Start decoding with the first token (usually <sos>)
        input = trg[:, 0]

        # Loop over each time step in the target sequence
        for t in range(1, trg_len):
            input_embedded = self.embedding(input).unsqueeze(1)  # Embed the current input token
            output, (hidden, cell) = self.decoder(input_embedded, (hidden, cell))  # Decode one step
            prediction = self.fc(output.squeeze(1))  # Pass decoder output through fully connected layer

            outputs[:, t] = prediction  # Store the prediction at current time step

            # Use teacher forcing (feeding correct output token back into the model)
            teacher_force = torch.rand(1).item() < teacher_forcing_ratio
            top1 = prediction.argmax(1)  # Get the predicted token
            input = trg[:, t] if teacher_force else top1  # Decide whether to use teacher forcing or not

        return outputs

### Dataset Class for Text Summarization

The `SummarizationDataset` class is a custom PyTorch `Dataset` that prepares data (articles and their corresponding summaries) for the text summarization task. Here's a breakdown of its functionality:

1. **Initialization (`__init__`)**:
    - **Parameters**:
        - `articles`: A list containing the source texts (input articles).
        - `summaries`: A list containing the target texts (summaries).
        - `vocab`: A dictionary mapping words to their corresponding indices (vocabulary).
        - `max_length`: The maximum length for input and output sequences, used for padding or truncating.

    - **Purpose**:
        - It initializes the dataset with articles, summaries, and the vocabulary while setting a maximum sequence length for consistent data processing.

2. **Dataset Length (`__len__`)**:
    - **Purpose**:
        - This method returns the number of samples in the dataset by returning the length of the `articles` list.

3. **Fetching Data Sample (`__getitem__`)**:
    - **Purpose**:
        - This method retrieves one sample (a pair of an article and its summary) from the dataset, converts the text into a sequence of token indices, and ensures the sequence length is consistent by padding or truncating the sequences.
    
    - **Process**:
        - Each article and summary is tokenized by converting the text into a list of indices based on the provided vocabulary (`vocab`).
        - Special tokens like `<sos>` (start of sentence) and `<eos>` (end of sentence) are added at the beginning and end of each sequence.
        - If any words are not found in the vocabulary, they are replaced with the `<unk>` (unknown) token.
        - Padding (`<pad>`) is added to sequences that are shorter than the maximum length (`max_length`), ensuring all sequences are of the same length for batch processing.

    - **Output**:
        - The method returns two tensors: one for the article and one for the corresponding summary, both padded or truncated to the same length (`max_length`).

#### Special Tokens:
- **`<sos>`**: Marks the start of a sentence.
- **`<eos>`**: Marks the end of a sentence.
- **`<pad>`**: Used to pad sequences to ensure consistent length across the dataset.
- **`<unk>`**: Represents unknown or out-of-vocabulary words.


In [3]:
# Dataset class to prepare data for summarization
class SummarizationDataset(Dataset):
    def __init__(self, articles, summaries, vocab, max_length=100):
        self.articles = articles  # List of articles (source text)
        self.summaries = summaries  # List of summaries (target text)
        self.vocab = vocab  # Vocabulary mapping
        self.max_length = max_length  # Maximum sequence length for padding/truncating

    # Return the length of the dataset
    def __len__(self):
        return len(self.articles)

    # Return a sample of data (article, summary) as tensors
    def __getitem__(self, idx):
        article = self.articles[idx]
        summary = self.summaries[idx]

        # Convert article to a list of token indices
        article_indices = [self.vocab['<sos>']] + [self.vocab.get(token, self.vocab['<unk>']) for token in article][:self.max_length-2] + [self.vocab['<eos>']]
        summary_indices = [self.vocab['<sos>']] + [self.vocab.get(token, self.vocab['<unk>']) for token in summary][:self.max_length-2] + [self.vocab['<eos>']]

        # Pad sequences to max_length
        article_indices = article_indices + [self.vocab['<pad>']] * (self.max_length - len(article_indices))
        summary_indices = summary_indices + [self.vocab['<pad>']] * (self.max_length - len(summary_indices))

        return torch.tensor(article_indices), torch.tensor(summary_indices)

### Data Loading and Preprocessing

1. **Load Data (`load_data`)**:
    - Loads headlines and contents from a CSV file.
    - **Input**: CSV file path.
    - **Output**: Lists of headlines and content.
    
2. **Tokenize Text (`tokenize`)**:
    - Tokenizes and lowercases the text using NLTK.
    - **Input**: Text string.
    - **Output**: Tokenized words.

3. **Build Vocabulary (`build_vocab`)**:
    - Creates a word-to-index vocabulary based on word frequency.
    - **Input**: Tokenized texts, minimum frequency (`min_freq=2`).
    - **Output**: `word2idx` and `idx2word` mappings, including special tokens (`<pad>`, `<unk>`, `<sos>`, `<eos>`).


In [4]:
# Load the dataset from a CSV file
file_path = r"D:\hindi_news_dataset.csv"
def load_data(file_path):
    df = pd.read_csv(file_path)
    return df['Headline'].tolist(), df['Content'].tolist() 

# Tokenize text using word_tokenize from nltk
def tokenize(text):
    return word_tokenize(text.lower())  # Tokenize and lowercase the text

# Build vocabulary from the dataset
def build_vocab(texts, min_freq=2):
    word_freq = Counter()  # Count word frequencies
    for text in texts:
        word_freq.update(text)

    # Initialize special tokens
    vocab = {'<pad>': 0, '<unk>': 1, '<sos>': 2, '<eos>': 3}

    # Add words with frequency >= min_freq
    for word, freq in word_freq.items():
        if freq >= min_freq:
            vocab[word] = len(vocab)

    return vocab, {v: k for k, v in vocab.items()}  # Return word2idx and idx2word mappings


### Data Preparation for Text Summarization

1. **Load and Tokenize Data**:
    - The articles and summaries are loaded from the CSV file and tokenized into word tokens.
    - **Steps**:
        - `articles, summaries = load_data(file_path)` loads the data.
        - `tokenized_articles = [tokenize(article) for article in articles]` tokenizes each article.
        - `tokenized_summaries = [tokenize(summary) for summary in summaries]` tokenizes each summary.

2. **Build Vocabulary**:
    - The vocabulary is built from the tokenized articles and summaries.
    - **Step**:
        - `vocab, inv_vocab = build_vocab(tokenized_articles + tokenized_summaries)` creates word-to-index and index-to-word mappings.

3. **Split Data**:
    - The tokenized data is split into training, validation, and test sets.
    - **Steps**:
        - `train_test_split` is used to split the data, first into training and test sets (80-20 split), then further splitting the training set to create a validation set (10% of training).


In [5]:
# Load and tokenize the articles and summaries
articles, summaries = load_data(file_path)
tokenized_articles = [tokenize(article) for article in articles]
tokenized_summaries = [tokenize(summary) for summary in summaries]

# Build vocabulary
vocab, inv_vocab = build_vocab(tokenized_articles + tokenized_summaries)

# Split the data into training, validation, and test sets
train_articles, test_articles, train_summaries, test_summaries = train_test_split(tokenized_articles, tokenized_summaries, test_size=0.2, random_state=42)
train_articles, val_articles, train_summaries, val_summaries = train_test_split(train_articles, train_summaries, test_size=0.1, random_state=42)


In [6]:
# Create datasets using the tokenized data and vocab
train_dataset = SummarizationDataset(train_articles, train_summaries, vocab, max_length=50)
val_dataset = SummarizationDataset(val_articles, val_summaries, vocab, max_length=50)
test_dataset = SummarizationDataset(test_articles, test_summaries, vocab, max_length=50)

# Create data loaders to feed data in batches
train_loader = DataLoader(train_dataset, batch_size=64, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=64)
test_loader = DataLoader(test_dataset, batch_size=64)

In [7]:
# Initialize model and hyperparameters
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')  # Use GPU if available

vocab_size = len(vocab)   # Size of the vocabulary
embedding_dim = 300       # Size of word embeddings
hidden_dim = 512          # Size of LSTM hidden state
output_dim = vocab_size   # Output size, generally the size of the vocabulary

# Initialize the BiLSTM model and move it to the device (GPU/CPU)
model = BiLSTMSummarizer(vocab_size, embedding_dim, hidden_dim, output_dim).to(device)

### Training Function

The `train` function performs the training process, updating model weights using batches of data.

1. **Training Mode**:
    - `model.train()` enables training-specific behaviors (like dropout).

2. **Batch Processing**:
    - Each batch (`src`, `trg`) is passed through the model, and outputs are reshaped for loss calculation.

3. **Loss and Backpropagation**:
    - Loss is calculated between predicted and target sequences, followed by backpropagation.

4. **Gradient Clipping**:
    - Gradients are clipped to prevent explosion (`clip=1`).

5. **Optimizer Step**:
    - Weights are updated using the optimizer.

6. **Return**:
    - Average loss per epoch.



In [8]:
# Training function
def train(model, iterator, optimizer, criterion, device, clip=1, teacher_forcing_ratio=0.5):
    model.train()  # Set model to training mode
    epoch_loss = 0
    for batch in tqdm(iterator, desc="Training"):  # Iterate over batches
        src, trg = batch
        src, trg = src.to(device), trg.to(device)

        optimizer.zero_grad()  # Clear gradients
        output = model(src, trg, teacher_forcing_ratio)  # Forward pass

        output_dim = output.shape[-1]
        output = output[:, 1:].reshape(-1, output_dim)  # Reshape output for loss calculation
        trg = trg[:, 1:].reshape(-1)  # Flatten target sequence

        loss = criterion(output, trg)  # Calculate loss
        loss.backward()  # Backpropagate
        torch.nn.utils.clip_grad_norm_(model.parameters(), clip)  # Clip gradients to avoid exploding gradient

        optimizer.step()  # Update parameters

        epoch_loss += loss.item()

    return epoch_loss / len(iterator)

### Evaluation Function

The `evaluate` function assesses the model's performance on validation or test data, without updating the model's weights.

1. **Evaluation Mode**:
    - `model.eval()` disables training-specific behaviors like dropout.
  
2. **No Gradient Calculation**:
    - `torch.no_grad()` ensures no gradients are computed, saving memory and speeding up the evaluation.

3. **Batch Processing**:
    - For each batch, the model is run with teacher forcing disabled (`teacher_forcing_ratio=0`).
    - The output and target are reshaped to match dimensions for loss calculation.

4. **Loss Calculation**:
    - Loss is computed between model predictions and target sequences.

5. **Return**:
    - The average loss over all batches is returned.



In [9]:
# Evaluation function
def evaluate(model, iterator, criterion, device):
    model.eval()  # Set model to evaluation mode
    epoch_loss = 0

    with torch.no_grad():  # Disable gradient calculation
        for batch in tqdm(iterator, desc="Evaluating"):
            src, trg = batch
            src, trg = src.to(device), trg.to(device)

            output = model(src, trg, 0)  # Turn off teacher forcing during evaluation

            output_dim = output.shape[-1]
            output = output[:, 1:].reshape(-1, output_dim)  # Reshape output for loss calculation
            trg = trg[:, 1:].reshape(-1)  # Flatten target sequence

            loss = criterion(output, trg)  # Calculate loss

            epoch_loss += loss.item()

    return epoch_loss / len(iterator)

### Beam Search for Text Generation

The `beam_search` function implements the beam search algorithm for sequence generation. This approach is used to generate the best possible output sequence by exploring multiple hypotheses at each decoding step.

1. **Embedding and Encoding**:
    - The input sequence is first embedded using the model's embedding layer.
    - The embedded sequence is passed through the encoder (a bi-directional LSTM), and the final hidden and cell states are obtained.

2. **Beam Initialization**:
    - The beam is initialized with the start-of-sequence token (`<sos>`), a score of 0, and the hidden and cell states from the encoder.

3. **Beam Search Process**:
    - For each time step, each sequence in the current beam is extended by predicting the next token using the decoder.
    - The top `beam_width` predictions (tokens with the highest probabilities) are selected.
    - These new sequences are added to the beam, and the beam is updated with the top `beam_width` sequences based on their cumulative scores.

4. **Handling Sequence Completion**:
    - If a sequence reaches the end-of-sequence token (`<eos>`) and is of sufficient length, it is added to the list of complete hypotheses.

5. **Final Sequence Selection**:
    - Once the beam search is complete or the maximum length is reached, the best sequence is selected from the completed hypotheses.
    - If no sequence ends with `<eos>`, the best incomplete sequence is chosen.

6. **Output**:
    - The selected sequence of token indices is converted back to words using the `inv_vocab` (index-to-word) mapping.



In [10]:
def beam_search(model, src, vocab, inv_vocab, beam_width=3, max_length=50, min_length=10, device='gpu'):
    model.eval()
    with torch.no_grad():
        # Embedding the input sequence
        embedded = model.embedding(src)  # shape: (batch_size, seq_len, embedding_dim)
        enc_output, (hidden, cell) = model.encoder(embedded)  # LSTM encoder output

        # In case of bi-directional LSTM, combine the hidden states
        if model.encoder.bidirectional:
            hidden = torch.cat((hidden[-2, :, :], hidden[-1, :, :]), dim=1)  # shape: (batch_size, hidden_dim)
            cell = torch.cat((cell[-2, :, :], cell[-1, :, :]), dim=1)        # shape: (batch_size, hidden_dim)
        else:
            hidden = hidden[-1, :, :]  # Take the last layer if not bi-directional
            cell = cell[-1, :, :]      # Take the last layer if not bi-directional

        # Now we process one sequence at a time, so set batch size to 1
        hidden = hidden.unsqueeze(0)  # shape: (1, batch_size, hidden_dim)
        cell = cell.unsqueeze(0)      # shape: (1, batch_size, hidden_dim)

        # Initialize the beam with the start-of-sequence token
        beam = [([vocab['<sos>']], 0, hidden[:, 0:1, :], cell[:, 0:1, :])]  # Start with one sequence
        complete_hypotheses = []

        # Perform beam search
        for t in range(max_length):
            new_beam = []
            for seq, score, hidden, cell in beam:
                # If end-of-sequence token is reached and length is >= min_length, add to complete hypotheses
                if seq[-1] == vocab['<eos>'] and len(seq) >= min_length:
                    complete_hypotheses.append((seq, score))
                    continue

                # Prepare the input for the decoder (last predicted token)
                input = torch.LongTensor([seq[-1]]).unsqueeze(0).to(device)  # shape: (1, 1)
                input_embedded = model.embedding(input)  # shape: (1, 1, embedding_dim)

                # Pass through the decoder with the current hidden and cell states
                output, (hidden, cell) = model.decoder(input_embedded, (hidden, cell))  # hidden, cell are (1, 1, hidden_dim)
                predictions = model.fc(output.squeeze(1))  # shape: (1, vocab_size)

                # Prevent EOS if sequence is shorter than minimum length
                if len(seq) < min_length:
                    predictions[0][vocab['<eos>']] = float('-inf')

                # Get top beam_width predictions
                top_preds = torch.topk(predictions, beam_width, dim=1)

                # For each top prediction, extend the sequence and update the beam
                for i in range(beam_width):
                    new_seq = seq + [top_preds.indices[0][i].item()]
                    new_score = score - top_preds.values[0][i].item()  # Negative log probability
                    new_hidden = hidden.clone()
                    new_cell = cell.clone()
                    new_beam.append((new_seq, new_score, new_hidden, new_cell))

            # Sort by score and keep top beam_width sequences
            beam = sorted(new_beam, key=lambda x: x[1])[:beam_width]

            if len(complete_hypotheses) >= beam_width:
                break

        # Sort and return the best sequence
        complete_hypotheses = sorted(complete_hypotheses, key=lambda x: x[1])
        if complete_hypotheses:
            best_seq = complete_hypotheses[0][0]
        else:
            best_seq = beam[0][0]

    # Convert sequence of indices back to words
    return [inv_vocab[idx] for idx in best_seq if idx not in [vocab['<sos>'], vocab['<eos>'], vocab['<pad>']]]

In [11]:
# Save model function
def save_model(model, vocab, filepath):
    torch.save({
        'model_state_dict': model.state_dict(),
        'vocab': vocab
    }, filepath)
    print(f"Model saved to {'Copy_of_Hindi_Summarization_Beam_Search copy.ipynb'}")

In [12]:
# Define optimizer and loss function
optimizer = optim.Adam(model.parameters())
criterion = nn.CrossEntropyLoss(ignore_index=vocab['<pad>'])

In [13]:
# Training loop
num_epochs = 10
best_val_loss = float('inf')
for epoch in range(num_epochs):
    train_loss = train(model, train_loader, optimizer, criterion, device)
    val_loss = evaluate(model, val_loader, criterion, device)
    print(f'Epoch: {epoch+1:02}')
    print(f'\tTrain Loss: {train_loss:.3f}')
    print(f'\t Val. Loss: {val_loss:.3f}')

    # Save model if validation loss improves
    if val_loss < best_val_loss:
        best_val_loss = val_loss
        save_model(model, vocab, 'best_model.pth')

Training: 100%|██████████| 2087/2087 [48:28<00:00,  1.39s/it] 
Evaluating: 100%|██████████| 232/232 [00:56<00:00,  4.10it/s]


Epoch: 01
	Train Loss: 5.205
	 Val. Loss: 6.031
Model saved to Copy_of_Hindi_Summarization_Beam_Search copy.ipynb


Training: 100%|██████████| 2087/2087 [36:46<00:00,  1.06s/it]
Evaluating: 100%|██████████| 232/232 [00:56<00:00,  4.10it/s]


Epoch: 02
	Train Loss: 3.184
	 Val. Loss: 4.814
Model saved to Copy_of_Hindi_Summarization_Beam_Search copy.ipynb


Training: 100%|██████████| 2087/2087 [36:41<00:00,  1.06s/it]
Evaluating: 100%|██████████| 232/232 [00:56<00:00,  4.11it/s]


Epoch: 03
	Train Loss: 2.267
	 Val. Loss: 4.081
Model saved to Copy_of_Hindi_Summarization_Beam_Search copy.ipynb


Training: 100%|██████████| 2087/2087 [36:30<00:00,  1.05s/it]
Evaluating: 100%|██████████| 232/232 [00:55<00:00,  4.15it/s]


Epoch: 04
	Train Loss: 1.783
	 Val. Loss: 3.636
Model saved to Copy_of_Hindi_Summarization_Beam_Search copy.ipynb


Training: 100%|██████████| 2087/2087 [36:35<00:00,  1.05s/it]
Evaluating: 100%|██████████| 232/232 [00:56<00:00,  4.12it/s]


Epoch: 05
	Train Loss: 1.458
	 Val. Loss: 3.278
Model saved to Copy_of_Hindi_Summarization_Beam_Search copy.ipynb


Training: 100%|██████████| 2087/2087 [36:46<00:00,  1.06s/it]
Evaluating: 100%|██████████| 232/232 [00:56<00:00,  4.11it/s]


Epoch: 06
	Train Loss: 1.235
	 Val. Loss: 3.069
Model saved to Copy_of_Hindi_Summarization_Beam_Search copy.ipynb


Training: 100%|██████████| 2087/2087 [36:43<00:00,  1.06s/it]
Evaluating: 100%|██████████| 232/232 [00:56<00:00,  4.10it/s]


Epoch: 07
	Train Loss: 1.076
	 Val. Loss: 2.922
Model saved to Copy_of_Hindi_Summarization_Beam_Search copy.ipynb


Training: 100%|██████████| 2087/2087 [36:45<00:00,  1.06s/it]
Evaluating: 100%|██████████| 232/232 [00:55<00:00,  4.15it/s]


Epoch: 08
	Train Loss: 0.957
	 Val. Loss: 2.742
Model saved to Copy_of_Hindi_Summarization_Beam_Search copy.ipynb


Training: 100%|██████████| 2087/2087 [36:36<00:00,  1.05s/it]
Evaluating: 100%|██████████| 232/232 [00:55<00:00,  4.15it/s]


Epoch: 09
	Train Loss: 0.859
	 Val. Loss: 2.773


Training: 100%|██████████| 2087/2087 [36:37<00:00,  1.05s/it]
Evaluating: 100%|██████████| 232/232 [00:56<00:00,  4.13it/s]


Epoch: 10
	Train Loss: 0.780
	 Val. Loss: 2.556
Model saved to Copy_of_Hindi_Summarization_Beam_Search copy.ipynb


In [14]:
# Load model function
def load_model(filepath, device):
    checkpoint = torch.load(filepath, map_location=device)
    vocab = checkpoint['vocab']
    model = BiLSTMSummarizer(vocab_size, embedding_dim, hidden_dim, output_dim).to(device)
    model.load_state_dict(checkpoint['model_state_dict'])
    return model, checkpoint

In [15]:
# Load the best model for testing
best_model, _ = load_model('best_model.pth', device)

# Test the model
test_loss = evaluate(best_model, test_loader, criterion, device)
print(f'Test Loss: {test_loss:.3f}')

# Evaluate using ROUGE score
rouge = Rouge()
best_model.eval()
predictions = []
references = []
with torch.no_grad():
    for batch in tqdm(test_loader, desc="Generating summaries"):
        src, trg = batch
        src = src.to(device)
        pred = beam_search(best_model, src, vocab, inv_vocab, min_length=10, device=device)  # Set minimum length
        predictions.extend([' '.join(pred)])
        references.extend([' '.join([inv_vocab[idx.item()] for idx in trg[0] if idx.item() not in [vocab['<sos>'], vocab['<eos>'], vocab['<pad>']]])])

# Ensure all predictions meet the minimum length
min_length = 10  # Set this to your desired minimum length
predictions = [p if len(p.split()) >= min_length else p + ' ' + ' '.join(['<pad>'] * (min_length - len(p.split()))) for p in predictions]

scores = rouge.get_scores(predictions, references, avg=True)
print("ROUGE scores:")
print(scores)

  checkpoint = torch.load(filepath, map_location=device)
Evaluating: 100%|██████████| 580/580 [02:20<00:00,  4.12it/s]


Test Loss: 2.539


Generating summaries: 100%|██████████| 580/580 [02:25<00:00,  3.99it/s]


ROUGE scores:
{'rouge-1': {'r': 0.8053488385281857, 'p': 0.8311310077492784, 'f': 0.8165485464083134}, 'rouge-2': {'r': 0.7288117552363507, 'p': 0.7373940617458541, 'f': 0.7327340576196352}, 'rouge-l': {'r': 0.7854555646572221, 'p': 0.8081892196140378, 'f': 0.795352900614367}}


In [16]:
print("Loading pre-trained model...")
trained_model, checkpoint = load_model('best_model.pth', device)
vocab = checkpoint['vocab']
inv_vocab = {v: k for k, v in vocab.items()}
trained_model = trained_model.to(device)

Loading pre-trained model...


  checkpoint = torch.load(filepath, map_location=device)


In [17]:
# Modified Summarization bot
def summarize_text(model, vocab, inv_vocab, text, max_length=100, min_length=10, beam_width=3, device='cpu', debug=False):
    model.eval()
    tokens = tokenize(text)[:max_length]
    indices = [vocab['<sos>']] + [vocab.get(token, vocab['<unk>']) for token in tokens] + [vocab['<eos>']]
    src = torch.LongTensor(indices).unsqueeze(0).to(device)

    summary = beam_search(model, src, vocab, inv_vocab, beam_width, max_length, min_length, device)

    if debug:
        print("Input tokens:", tokens)
        print("Input indices:", indices)
        print("Generated indices:", [vocab[word] for word in summary])
        print("Summary length:", len(summary))

    return ' '.join(summary)

In [18]:
# Example usage of the summarization bot
input_text = "ऑस्ट्रेलिया ने ब्लूमफोनटीन में पहले वनडे में दक्षिण अफ्रीका को 3-विकेट से हरा दिया। यह 12 वर्षों में दक्षिण अफ्रीका के खिलाफ उसकी धरती पर ऑस्ट्रेलिया की पहली वनडे जीत है। ऑस्ट्रेलिया का स्कोर 16.3 ओवर में 113/7 था लेकिन मार्नस लबुशेन और ऐश्टन एगर की 112* रनों की साझेदारी की बदौलत उसने 40.2 ओवर में लक्ष्य हासिल कर लिया।"
summary = summarize_text(trained_model, vocab, inv_vocab, input_text, min_length=10, device=device, debug=True)
print("Generated Summary:")
print(summary)
print("Summary length:", len(summary.split()))

Input tokens: ['ऑस्ट्रेलिया', 'ने', 'ब्लूमफोनटीन', 'में', 'पहले', 'वनडे', 'में', 'दक्षिण', 'अफ्रीका', 'को', '3-विकेट', 'से', 'हरा', 'दिया।', 'यह', '12', 'वर्षों', 'में', 'दक्षिण', 'अफ्रीका', 'के', 'खिलाफ', 'उसकी', 'धरती', 'पर', 'ऑस्ट्रेलिया', 'की', 'पहली', 'वनडे', 'जीत', 'है।', 'ऑस्ट्रेलिया', 'का', 'स्कोर', '16.3', 'ओवर', 'में', '113/7', 'था', 'लेकिन', 'मार्नस', 'लबुशेन', 'और', 'ऐश्टन', 'एगर', 'की', '112', '*', 'रनों', 'की', 'साझेदारी', 'की', 'बदौलत', 'उसने', '40.2', 'ओवर', 'में', 'लक्ष्य', 'हासिल', 'कर', 'लिया।']
Input indices: [2, 3351, 83, 29389, 10, 1276, 3352, 10, 3184, 965, 76, 29390, 37, 3192, 27649, 229, 605, 489, 10, 3184, 965, 12, 323, 431, 3771, 98, 3351, 8, 575, 3352, 2706, 27646, 3351, 24, 3490, 29391, 3396, 10, 29392, 28, 2458, 3769, 3770, 73, 29393, 29394, 8, 10147, 3628, 8210, 8, 11848, 8, 3884, 4727, 29395, 3396, 10, 1983, 3806, 103, 27891, 3]
Generated indices: [86, 12, 1648, 56, 3352, 490, 3358, 1958, 12, 1276, 1243, 10, 3351, 12, 323, 405, 679, 12, 2486, 5367, 3405,

**ROUGE Score Evaluation**

The following ROUGE scores evaluate the performance of the text summarization model in terms of ROUGE-1, ROUGE-2, and ROUGE-L.

**ROUGE-1 (Unigrams)**

Recall: 0.805

Precision: 0.831

F1-score: 0.817

Analysis:

ROUGE-1 measures the overlap of unigrams (individual words) between the generated summaries and reference summaries. The recall score of 0.805 means that the model captures 80.5% of relevant unigrams from the reference summaries. The precision score of 0.831 shows that 83.1% of the unigrams generated by the model are correct. The F1-score, which balances precision and recall, is 0.817, indicating strong overall performance in capturing individual words.

**ROUGE-2 (Bigrams)**

Recall: 0.729

Precision: 0.737

F1-score: 0.733

Analysis:
ROUGE-2 focuses on the overlap of bigrams (pairs of consecutive words). The recall of 0.729 indicates that 72.9% of the relevant bigrams from the reference summaries are captured by the model. The precision score of 0.737 means that 73.7% of the bigrams in the generated summaries are correct. The F1-score of 0.733 reflects the model's reasonable performance in capturing longer word sequences, though it is lower than ROUGE-1, as expected, due to the increased complexity of matching bigrams.

**ROUGE-L (Longest Common Subsequence)**

Recall: 0.785

Precision: 0.808

F1-score: 0.795

Analysis:
ROUGE-L evaluates the longest common subsequence between the generated and reference summaries, focusing on capturing the overall structure of the text. The recall score of 0.785 shows that the model aligns well with the reference summaries, capturing 78.5% of the longest subsequences. Precision is higher at 0.808, meaning that 80.8% of the generated subsequences are correct. The F1-score of 0.795 demonstrates strong performance in maintaining the structural integrity of the summaries, comparable to ROUGE-1.

**Summary of Comparison:**

ROUGE-1 has the highest scores, reflecting the model's strength in capturing individual words accurately.

ROUGE-2 shows slightly lower scores, indicating that the model finds it more challenging to capture exact bigram (two-word sequence) matches.

ROUGE-L closely follows ROUGE-1 in performance, highlighting the model’s ability to capture the overall sequence structure and flow of the summaries.

Overall, the model performs well in generating summaries with strong word overlap (ROUGE-1) and sequence structure (ROUGE-L), though it shows some difficulty in matching consecutive word pairs (ROUGE-2).