<a href="https://colab.research.google.com/github/snipaid-nlg/models/blob/main/news-headline-generation/news-headline-generation-with-mt5-small.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# News Headline Generation with mt5-small
Evaluating [mt5-small](https://huggingface.co/google/mt5-small) on the task of generating headlines from journalistic text.

## Installs and Imports

In [None]:
!pip install -q transformers datasets accelerate sentencepiece

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.3/6.3 MB[0m [31m49.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m469.0/469.0 KB[0m [31m22.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m212.8/212.8 KB[0m [31m9.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m17.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m199.2/199.2 KB[0m [31m3.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.6/7.6 MB[0m [31m31.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m132.9/132.9 KB[0m [31m10.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m212.2/212.2 KB[0m [31m6.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━

In [None]:
from datasets import load_dataset, Dataset
from transformers.pipelines.pt_utils import KeyDataset
from transformers import AutoModelForSeq2SeqLM, T5Tokenizer, pipeline
import pandas as pd
from tqdm.auto import tqdm

## Loading and transforming 10kGNAD
We load the [Ten Thousand German News Articles Dataset](https://tblock.github.io/10kGNAD/) from Huggingface. The first sentence of a text (up to the first period) is a headline, so we split the text and use the second part. A Prefix `summarize:` is added because the model was trained on the task of summarization.

In [None]:
gnad10 = load_dataset("gnad10")
df = pd.DataFrame(gnad10["train"])
df[['headline', 'fulltext']] = df["text"].str.split('.', n=1, expand=True)
df.drop(['label','text'], axis=1, inplace=True)
df['fulltext'] = "summarize: " + df['fulltext']
df.head()



  0%|          | 0/2 [00:00<?, ?it/s]

Unnamed: 0,headline,fulltext
0,21-Jähriger fällt wohl bis Saisonende aus,summarize: Wien – Rapid muss wohl bis Saisone...
1,"Erfundene Bilder zu Filmen, die als verloren g...",summarize: Wien – Die Filmgeschichte ist ein ...
2,Der frischgekürte CEO Sundar Pichai setzt auf ...,summarize: Die Atmosphäre im Silicon Valley i...
3,"Putin: ""Einigung, dass wir Menge auf Niveau vo...",summarize: Moskau – Die russischen Ölproduzen...
4,Estland sieht den künftigen österreichischen P...,summarize: Wien/Tallinn/Pskow – Die Eltern de...


Converting the pandas dataframe to a Huggingface dataset object

In [None]:
dataset = Dataset.from_pandas(df)

## Loading the model

In [None]:
model = AutoModelForSeq2SeqLM.from_pretrained("google/mt5-small", use_cache=True, device_map='auto')
tokenizer = T5Tokenizer.from_pretrained("google/mt5-small")

Downloading (…)lve/main/config.json:   0%|          | 0.00/553 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/1.20G [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

Downloading (…)ve/main/spiece.model:   0%|          | 0.00/4.31M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/99.0 [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/82.0 [00:00<?, ?B/s]

In [None]:
template = {
    "num_beams": 5,
    "length_penalty": 1.0,
    "min_new_tokens": 5,
    "no_repeat_ngram_size": 0,
    "repetition_penalty": 1.0,
    "diversity_penalty": 0.0,
    "num_beam_groups": 1,
    "do_sample": False,
    "top_k": 50,
    "top_p": 1.0,
    "temperature": 1.0,
    "early_stopping": False,
    "pad_token_id": 0,
    "eos_token_id": 1,
    "num_return_sequences": 1
}

In [None]:
pipe = pipeline("text2text-generation",
                device=0,
                model=model, 
                tokenizer=tokenizer, 
                **template)

## Generating Headlines

In [None]:
for out in tqdm(pipe(KeyDataset(dataset, "fulltext"))):
    print(out)

  0%|          | 0/9245 [00:00<?, ?it/s]

[{'generated_text': '<extra_id_0>))))))))'}]
[{'generated_text': '<extra_id_0> finden sich auch in diesem Film.'}]
[{'generated_text': '<extra_id_0> Google-Chef Hiroshi Lockheimer'}]
[{'generated_text': '<extra_id_0> auf dem Jänner-Niveau einfrieren.'}]
[{'generated_text': '<extra_id_0> auch in der Region Pskow erscheinen.'}]
[{'generated_text': '<extra_id_0> - MMC in Chicago setzte mit 50 Mitarbeitern rund 100 Mitarbeiter um.'}]
[{'generated_text': '<extra_id_0> an der Europa League-Qualifikation'}]
[{'generated_text': '<extra_id_0> der ORF-Führung'}]
[{'generated_text': '<extra_id_0>: Islamabad -'}]
[{'generated_text': '<extra_id_0> spielen. © Getty Images'}]
[{'generated_text': '<extra_id_0> Bastian Schweinsteiger gelang in der Schlussminute aus einem Elfmeter.'}]
[{'generated_text': '<extra_id_0> Markus Friesacher'}]
[{'generated_text': '<extra_id_0> das Original FFVII.'}]
[{'generated_text': '<extra_id_0>: Leicester – Teamkapitän Christian Fuchs'}]
[{'generated_text': '<extra_id_0

KeyboardInterrupt: ignored