# Implementing ROUGE-L Score for LLM Summarization Evaluation

### Background

The ROUGE-L score is a critical metric for evaluating text summarization quality, measuring the longest common subsequence (LCS) between a generated summary and reference summaries. This implementation demonstrates how to implement and use this metric to evaluate LLM-generated summaries.

### Objectives

*   Understand and implement the ROUGE-L scoring metric
*   Work with real-world summarization data
*   Gain practical experience with LLM APIs
*   Apply text preprocessing techniques
*   Evaluate machine-generated summaries

### Implementation Overview

#### Data Preparation

- Load the CNN/DailyMail dataset using the Hugging Face datasets library

In [None]:
!pip install datasets



In [None]:
from datasets import load_dataset

# Loading 10 samples for demonstration
try:
    dataset = load_dataset("cnn_dailymail", "3.0.0", split="train[:10]")
    print(f"Loaded dataset split with {len(dataset)} samples")
except Exception as e:
    print(f"Error loading CNN/DailyMail dataset: {e}")
    dataset = None


README.md: 0.00B [00:00, ?B/s]

3.0.0/train-00000-of-00003.parquet:   0%|          | 0.00/257M [00:00<?, ?B/s]

3.0.0/train-00001-of-00003.parquet:   0%|          | 0.00/257M [00:00<?, ?B/s]

3.0.0/train-00002-of-00003.parquet:   0%|          | 0.00/259M [00:00<?, ?B/s]

3.0.0/validation-00000-of-00001.parquet:   0%|          | 0.00/34.7M [00:00<?, ?B/s]

3.0.0/test-00000-of-00001.parquet:   0%|          | 0.00/30.0M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/287113 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/13368 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/11490 [00:00<?, ? examples/s]

Loaded dataset split with 10 samples


- Implement text preprocessing functions
  - Basic text cleaning and special character handling
  - Handle contractions and whitespace

- Text tokenization and normalization
  - NLTK tokenization with fallback
  - Case normalization and word stemming using PorterStemmer

- Error handling and robustness
  - Proper error handling for all preprocessing steps
  - Appropriate fallback mechanisms

In [None]:
!pip install nltk>=3.6.3

In [None]:
import re
import nltk
from nltk.tokenize import word_tokenize

def setup_nltk():
    """Download required NLTK resources"""
    try:
        nltk.download('punkt')
        nltk.download('averaged_perceptron_tagger')
        nltk.download('wordnet')

        print("NLTK resources downloaded successfully!")
    except Exception as e:
        print(f"Error downloading NLTK resources: {e}")
        raise

setup_nltk()


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...


NLTK resources downloaded successfully!


In [None]:
!pip install num2words

Collecting num2words
  Downloading num2words-0.5.14-py3-none-any.whl.metadata (13 kB)
Collecting docopt>=0.6.2 (from num2words)
  Downloading docopt-0.6.2.tar.gz (25 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Downloading num2words-0.5.14-py3-none-any.whl (163 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m163.5/163.5 kB[0m [31m6.0 MB/s[0m eta [36m0:00:00[0m
[?25hBuilding wheels for collected packages: docopt
  Building wheel for docopt (setup.py) ... [?25l[?25hdone
  Created wheel for docopt: filename=docopt-0.6.2-py2.py3-none-any.whl size=13706 sha256=78047a3bec106ebf95a3dbe198863785ef220aaa9296a41608842709200df954
  Stored in directory: /root/.cache/pip/wheels/1a/bf/a1/4cee4f7678c68c5875ca89eaccf460593539805c3906722228
Successfully built docopt
Installing collected packages: docopt, num2words
Successfully installed docopt-0.6.2 num2words-0.5.14


In [None]:
from num2words import num2words
from nltk.stem import PorterStemmer

class TextPreprocessor:
    def __init__(self):
        self.stemmer = PorterStemmer()
        try:
            word_tokenize("Test sentence.")
        except LookupError as e:
            print("NLTK resources not found. Running setup again...")
            setup_nltk()

        self.contractions = {
            "can't": "cannot",
            "won't": "will not",
            "don't": "do not",
            "doesn't": "does not",
            "didn't": "did not",
            "isn't": "is not",
            "aren't": "are not",
            "wasn't": "was not",
            "weren't": "were not",
            "it's": "it is",
            "i'm": "i am",
            "they're": "they are",
            "we're": "we are",
            "you're": "you are",
            "i've": "i have",
            "we've": "we have",
            "they've": "they have",
            "shouldn't": "should not",
            "couldn't": "could not",
            "wouldn't": "would not"
        }

    def expand_contractions(self, text):
        for contraction, expansion in self.contractions.items():
            text = text.replace(contraction, expansion)
            text = text.replace(contraction.capitalize(), expansion.capitalize())
        return text

    def remove_special_characters(self, text):
        """
        More careful handling of quotation marks and numbers
        """
        # Implement Your Code Here

        # Keep content in parentheses
        text = re.sub(r'[()]', ' ', text)  # code here

        # Remove URLs and emails
        text = re.sub(r'https?://\S+|www\.\S+', ' ', text)  # code here
        text = re.sub(r'\S+@\S+', ' ', text)  # code here

        # Convert numbers to standard form
        def _num_to_words(match):  # code here
            try:
                return num2words(int(match.group(0)))  # code here
            except Exception:
                return match.group(0)  # code here

        text = re.sub(r'\b\d+\b', _num_to_words, text)  # code here

        # More careful with quotes and special characters
        text = text.replace("“", '"').replace("”", '"').replace("’", "'")  # code here
        text = re.sub(r"[^A-Za-z0-9\s\.\,\!\?\;\:']", " ", text)  # code here
        return ' '.join(text.split())

    def tokenize_text(self, text):
        """
        Updated tokenization to better match rouge-score
        """
        try:
            tokens = word_tokenize(text)
            return [token for token in tokens if token not in {'``', "''"}]
        except LookupError:
            print("Warning: Using basic tokenization as fallback")
            return text.split()

    def normalize_case(self, tokens):
        """
        Add stemming to handle word variations
        """
        tokens = [token.lower() for token in tokens]
        return [self.stemmer.stem(token) for token in tokens]

    def preprocess(self, text):
        # Extract acronyms before processing
        acronyms = re.findall(r'\b(?:[A-Za-z]\.){2,}', text)

        # Use functions defined above
        try:
            text = self.expand_contractions(text)
            text = self.remove_special_characters(text)
            tokens = self.tokenize_text(text)
        except Exception as e:
            print(f"Warning during preprocessing, using simple fallback: {e}")
            tokens = text.split()

        # Appropriate fallback mechanisms
        if not tokens:
            tokens = text.split()

        tokens = self.normalize_case(tokens)

        # Re-introduce acronyms in normalized form
        for a in acronyms:
            a_clean = a.replace('.', '').lower()
            stemmed = self.stemmer.stem(a_clean)
            if stemmed not in tokens:
                tokens.append(stemmed)

        return tokens

In [None]:
# Initialize preprocessor
preprocessor = TextPreprocessor()

# Test with sample text
sample_text = "Hello! This is a sample text w/ special chars... Check it out @ http://example.com"

try:
    processed_tokens = preprocessor.preprocess(sample_text)
    print(f"Original text: {sample_text}")
    print(f"Processed tokens: {processed_tokens}")
except Exception as e:
    print(f"Error processing text: {e}")

NLTK resources not found. Running setup again...
NLTK resources downloaded successfully!
Original text: Hello! This is a sample text w/ special chars... Check it out @ http://example.com
Processed tokens: ['hello!', 'thi', 'is', 'a', 'sampl', 'text', 'w', 'special', 'chars...', 'check', 'it', 'out']


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


#### Generate Summaries using Gemini API

- Set up Gemini API authentication
- Implement API calling function with rate limiting
- Handle API responses and errors
- Response processing

In [None]:
!pip install -q -U google-genai

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/47.7 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m47.7/47.7 kB[0m [31m2.2 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/262.0 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m262.0/262.0 kB[0m [31m8.2 MB/s[0m eta [36m0:00:00[0m
[?25h

In [None]:
from google.colab import userdata
import time
from google import genai

# Gemini API setup

GEMINI_API_KEY = userdata.get("GEMINI_API_KEY")
if GEMINI_API_KEY is None:
    raise ValueError("Please set GEMINI_API_KEY in Colab userdata as 'GEMINI_API_KEY'.")

# Create a Gemini client using the API key
client = genai.Client(api_key=GEMINI_API_KEY)


def get_summary(text, model="gemini-2.5-flash", max_retries=3):
    """
    Implement API calling function with rate limiting,
    handle API responses and errors, and basic response processing.
    """
    if not text:
        return ""

    delay_seconds = 1.0

    for attempt in range(max_retries):
        try:
            prompt = (
                "You are a helpful assistant that writes concise summaries.\n"
                "Summarize the following text in 3–4 sentences, preserving the key facts:\n\n"
                f"{text}"
            )

            response = client.models.generate_content(
                model=model,
                contents=prompt,
            )

            summary = getattr(response, "text", "")
            if summary is None:
                summary = ""

            return summary.strip()

        except Exception as e:
            error_message = str(e)
            print(f"Gemini API error on attempt {attempt + 1}: {error_message}")

            # Simple rate limiting / retry logic on 429 or rate-related errors
            if (
                ("429" in error_message) or ("rate" in error_message.lower())
            ) and attempt < max_retries - 1:
                time.sleep(delay_seconds)
                delay_seconds *= 2
                continue

            break

    return ""


#### ROUGE-L and ROUGE-LSum Implementation

##### Basic ROUGE-L Implementation

###### LCS Table Implementation

In [None]:
import numpy as np
from typing import List, Dict

def get_lcs_table(ref_tokens: List[str], pred_tokens: List[str]) -> np.ndarray:
    """
    Compute the Longest Common Subsequence table
    """
    m = len(ref_tokens)
    n = len(pred_tokens)

    # LCS DP table of size (m+1) x (n+1)
    lcs_table = np.zeros((m + 1, n + 1), dtype=int)

    for i in range(1, m + 1):
        for j in range(1, n + 1):
            if ref_tokens[i - 1] == pred_tokens[j - 1]:
                lcs_table[i, j] = lcs_table[i - 1, j - 1] + 1
            else:
                lcs_table[i, j] = max(lcs_table[i - 1, j], lcs_table[i, j - 1])

    return lcs_table

###### ROUGE-L Score Calculation

In [None]:
def compute_rouge_l(reference: List[str], prediction: List[str], beta: float = 1.2) -> Dict[str, float]:
    """
    Basic ROUGE-L computation
    """
    if not reference or not prediction:
        return {'precision': 0.0, 'recall': 0.0, 'f1': 0.0}

    lcs_table = get_lcs_table(reference, prediction)
    lcs_length = lcs_table[len(reference), len(prediction)]

    precision = lcs_length / len(prediction)
    recall = lcs_length / len(reference)

    if precision == 0.0 or recall == 0.0:
        f1 = 0.0
    else:
        beta2 = beta ** 2
        f1 = (1 + beta2) * precision * recall / (recall + beta2 * precision)

    return {
        'precision': precision,
        'recall': recall,
        'f1': f1
    }

##### ROUGE-LSum Implementation

###### Split Tokens into Sentences

In [None]:
def split_into_sentences(tokens: List[str]) -> List[List[str]]:
    """
    Split tokens into sentences
    """
    sentences = []
    current_sentence = []

    for token in tokens:
        current_sentence.append(token)
        if token in {'.', '!', '?'}:
            if current_sentence:
                sentences.append(current_sentence)
                current_sentence = []

    if current_sentence:
        sentences.append(current_sentence)

    return sentences

###### ROUGE-LSum Score Calculation

In [None]:
def compute_rouge_lsum(reference: List[str], prediction: List[str], beta: float = 1.2) -> Dict[str, float]:
    """
    Compute ROUGE-LSum score
    """
    if not reference or not prediction:
        return {'precision': 0.0, 'recall': 0.0, 'f1': 0.0}

    try:
        # Split into sentences
        ref_sentences = split_into_sentences(reference)
        pred_sentences = split_into_sentences(prediction)

        total_lcs_length = 0
        for ref_sent in ref_sentences:
            max_lcs_length = 0
            for pred_sent in pred_sentences:
                lcs_table = get_lcs_table(ref_sent, pred_sent)
                lcs_len = lcs_table[len(ref_sent), len(pred_sent)]
                max_lcs_length = max(max_lcs_length, lcs_len)
            total_lcs_length += max_lcs_length

        total_ref_length = sum(len(s) for s in ref_sentences)
        total_pred_length = sum(len(s) for s in pred_sentences)

        precision = total_lcs_length / total_pred_length if total_pred_length > 0 else 0.0
        recall = total_lcs_length / total_ref_length if total_ref_length > 0 else 0.0

        if precision == 0.0 or recall == 0.0:
            f1 = 0.0
        else:
            beta2 = beta ** 2
            f1 = (1 + beta2) * precision * recall / (recall + beta2 * precision)

        return {
            'precision': precision,
            'recall': recall,
            'f1': f1
        }

    except Exception as e:
        print(f"Error in ROUGE-LSum computation: {e}")
        return {'precision': 0.0, 'recall': 0.0, 'f1': 0.0}

##### Testing Implementation

Test ROUGE implementation using CNN/DailyMail dataset and Gemini API summarization

- Dataset integration
  - Successfully load CNN/DailyMail dataset
  - Handle data extraction properly

- Preprocessing implementation
  - Implement text cleaning and tokenization
  - Handle preprocessing edge cases

- API integration
  - Implement Gemini API calls
  - Handle API errors appropriately

- Official library comparison
  - Install and integrate rouge-score library
  - Compare custom scores with official library scores
  - Analyze and document differences (max difference < 5%)

- Score calculation and results analysis
  - Calculate and display both custom and official ROUGE scores
  - Provide clear comparison of results
  - Understand any significant differences and potential improvements

In [None]:
# First install the rouge-score library
!pip install rouge-score

Collecting rouge-score
  Downloading rouge_score-0.1.2.tar.gz (17 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: rouge-score
  Building wheel for rouge-score (setup.py) ... [?25l[?25hdone
  Created wheel for rouge-score: filename=rouge_score-0.1.2-py3-none-any.whl size=24934 sha256=3de1c17c95435de3e8030697ec3d21d002fe4c958c6c564bc8381eca62fad0a5
  Stored in directory: /root/.cache/pip/wheels/85/9d/af/01feefbe7d55ef5468796f0c68225b6788e85d9d0a281e7a70
Successfully built rouge-score
Installing collected packages: rouge-score
Successfully installed rouge-score-0.1.2


In [None]:
from rouge_score import rouge_scorer

def test_rouge_with_dataset(sample_idx: int):
    """
    Test ROUGE implementation using a single article from CNN/DailyMail dataset

    Args:
        sample_idx: Index of the article to test
    """
    # Initialize preprocessor and official scorer
    preprocessor = TextPreprocessor()
    official_scorer = rouge_scorer.RougeScorer(['rougeL'], use_stemmer=True)

    print(f"Testing ROUGE scores with article index {sample_idx} from CNN/DailyMail dataset...")

    try:
        # Get the article
        article = dataset[sample_idx]

        # Get original article and reference summary
        original_text = article.get('article', '')
        reference_summary = article.get('highlights', '')

        print(f"\nOriginal text length: {len(original_text)}")
        print(f"Reference summary length: {len(reference_summary)}")

        # Generate summary using Gemini API
        generated_summary = get_summary(original_text)
        if not generated_summary:
            print("Error: Could not generate summary")
            return None

        # Preprocess texts for custom implementation
        ref_tokens = preprocessor.preprocess(reference_summary)
        pred_tokens = preprocessor.preprocess(generated_summary)

        # Calculate custom ROUGE scores
        rouge_l_scores = compute_rouge_l(ref_tokens, pred_tokens)
        rouge_lsum_scores = compute_rouge_lsum(ref_tokens, pred_tokens)

        # Calculate official ROUGE scores
        official_scores = official_scorer.score(reference_summary, generated_summary)

        # Store results
        results = {
            'article_id': sample_idx,
            'original_length': len(original_text),
            'reference_length': len(reference_summary),
            'generated_length': len(generated_summary),
            'custom_rouge_l': rouge_l_scores,
            'custom_rouge_lsum': rouge_lsum_scores,
            'official_rouge_l': {
                'precision': official_scores['rougeL'].precision,
                'recall': official_scores['rougeL'].recall,
                'f1': official_scores['rougeL'].fmeasure,
            }
        }

        # Calculate differences
        diff_precision = abs(rouge_l_scores['precision'] - official_scores['rougeL'].precision)
        diff_recall = abs(rouge_l_scores['recall'] - official_scores['rougeL'].recall)
        diff_f1 = abs(rouge_l_scores['f1'] - official_scores['rougeL'].fmeasure)
        max_diff = max(diff_precision, diff_recall, diff_f1)

        # Print detailed results
        print(f"\nArticle Results:")
        print("-" * 50)
        print("\nReference Summary:")
        print(reference_summary)
        print("\nGenerated Summary:")
        print(generated_summary)

        print("\nCustom ROUGE-L Scores:")
        print(f"Precision: {rouge_l_scores['precision']:.3f}")
        print(f"Recall: {rouge_l_scores['recall']:.3f}")
        print(f"F1: {rouge_l_scores['f1']:.3f}")

        print("\nOfficial ROUGE-L Scores:")
        print(f"Precision: {official_scores['rougeL'].precision:.3f}")
        print(f"Recall: {official_scores['rougeL'].recall:.3f}")
        print(f"F1: {official_scores['rougeL'].fmeasure:.3f}")

        print("\nCustom ROUGE-LSum Scores:")
        print(f"Precision: {rouge_lsum_scores['precision']:.3f}")
        print(f"Recall: {rouge_lsum_scores['recall']:.3f}")
        print(f"F1: {rouge_lsum_scores['f1']:.3f}")

        print("\nImplementation Comparison:")
        print(f"Maximum difference between implementations: {max_diff:.3f}")
        if max_diff < 0.05:
            print("✓ Custom implementation closely matches the official library (within 5% threshold)")
        else:
            print("⚠ Custom implementation shows significant differences from the official library")

        return results

    except Exception as e:
        print(f"Error processing article {sample_idx}: {e}")
        if 'article' in locals():
            print(f"Article structure: {article.keys()}")  # Print keys to debug
        return None

In [None]:
import random

# Get dataset size
dataset_size = len(dataset)
print(f"Dataset size: {dataset_size}")

# Generate 2 random indices
indices = random.sample(range(dataset_size), 2)
print(f"Testing articles at indices: {indices}")

# Test each randomly selected article
for idx in indices:
    print(f"\nTesting article at index {idx}")
    result = test_rouge_with_dataset(idx)
    if result:
        print(f"Successfully processed article {idx}")
    else:
        print(f"Failed to process article {idx}")

Dataset size: 10
Testing articles at indices: [2, 4]

Testing article at index 2
NLTK resources not found. Running setup again...
NLTK resources downloaded successfully!
Testing ROUGE scores with article index 2 from CNN/DailyMail dataset...

Original text length: 3940
Reference summary length: 224


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!



Article Results:
--------------------------------------------------

Reference Summary:
NEW: "I thought I was going to die," driver says .
Man says pickup truck was folded in half; he just has cut on face .
Driver: "I probably had a 30-, 35-foot free fall"
Minnesota bridge collapsed during rush hour Wednesday .

Generated Summary:
Drivers on the Minneapolis bridge described a sudden, "lightning-quick" collapse, experiencing terrifying free falls as the structure completely gave way, sending vehicles into the water and on fire. Despite their own harrowing escapes and injuries, survivors like Gary Babineau immediately helped pull screaming, bleeding children from a school bus and aided other victims. Emergency room physician Dr. John Hink and other first responders rushed to the chaotic scene, facing difficult access to victims on the river. Despite these challenges, volunteers and officials managed to rescue 55 people within two hours through organized efforts.

Custom ROUGE-L Scores:


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!



Article Results:
--------------------------------------------------

Reference Summary:
NEW: NFL chief, Atlanta Falcons owner critical of Michael Vick's conduct .
NFL suspends Falcons quarterback indefinitely without pay .
Vick admits funding dogfighting operation but says he did not gamble .
Vick due in federal court Monday; future in NFL remains uncertain .

Generated Summary:
NFL quarterback Michael Vick has been indefinitely suspended without pay following his admission to participating in a dogfighting ring. As part of a plea agreement with federal prosecutors, Vick confessed to involvement in the illegal operation, including killing dogs that did not perform well. He is scheduled to appear in court to plead guilty to a federal conspiracy charge, which carries a potential prison sentence. NFL Commissioner Roger Goodell condemned Vick's "cruel and reprehensible" actions, and the Falcons may seek to recover a significant portion of his signing bonus.

Custom ROUGE-L Scores:
Precisi

* Across the two tested articles, the custom ROUGE-L scores show broadly similar patterns to the official rouge-score outputs, with precision, recall, and F1 staying in the same general range and preserving the same trends (recall > precision, F1 in between).

* For article 2, the maximum difference between custom and official ROUGE-L scores is very small (about 0.019), which falls well within the 5% threshold. This suggests that, at least on some examples, the custom LCS and ROUGE-L implementation is very closely aligned with the official library.

* For article 4, the differences are larger (maximum difference around 0.067), especially for recall and F1, indicating that there are still cases where the tokenization, sentence segmentation, or normalization pipeline diverges enough from the official behavior to noticeably change the LCS length and resulting scores.

* The fact that ROUGE-LSum scores differ more clearly from ROUGE-L in these runs (especially for article 4, where ROUGE-LSum recall and F1 are substantially higher) suggests that the sentence-level aggregation in the custom ROUGE-LSum implementation is doing something meaningfully different from a single global LCS, and is capturing more overlap when aligned sentence by sentence.

* Overall, the results indicate that the core implementation is mostly correct and capable of matching the official scores on some samples, but to consistently stay within the 5% threshold, further alignment of preprocessing details (tokenization rules, punctuation handling, and sentence splitting heuristics) with rouge-score’s internal assumptions would be necessary.


### References

See, A., Liu, P. J., & Manning, C. D. (2017). Get to the point: Summarization with pointer-generator networks. arXiv preprint arXiv:1704.04368.