<a href="https://colab.research.google.com/github/snipaid-nlg/models/blob/main/news-headline-generation/news-headline-generation-with-bloom-6b4-clp-german.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# News Headline Generation with bloom-6b4-clp-german
Evaluating bloom-6b4-clp-german on the task of generating headlines from journalistic text in german.

## Installs and Imports

In [None]:
!pip install -q transformers datasets accelerate bitsandbytes sentencepiece

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.3/6.3 MB[0m [31m29.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m469.0/469.0 KB[0m [31m21.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m212.8/212.8 KB[0m [31m2.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m76.3/76.3 MB[0m [31m8.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m19.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m199.2/199.2 KB[0m [31m6.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.6/7.6 MB[0m [31m33.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m132.9/132.9 KB[0m [31m8.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━

In [None]:
from datasets import load_dataset, Dataset
from transformers.pipelines.pt_utils import KeyDataset
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
import pandas as pd
from tqdm.auto import tqdm

## Loading and transforming 10kGNAD
We load the [Ten Thousand German News Articles Dataset](https://tblock.github.io/10kGNAD/) from Huggingface. The first sentence of a text (up to the first period) is a headline, so we split the text and use the second part. A Suffix `Headline:` is added to prompt the model.

In [None]:
gnad10 = load_dataset("gnad10")
df = pd.DataFrame(gnad10["train"])
df[['headline', 'fulltext']] = df["text"].str.split('.', n=1, expand=True)
df.drop(['label','text'], axis=1, inplace=True)
df['fulltext'] = df['fulltext'] + " Headline:"
df.head()

Downloading builder script:   0%|          | 0.00/3.59k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/2.07k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/6.66k [00:00<?, ?B/s]

Downloading and preparing dataset gnad10/default to /root/.cache/huggingface/datasets/gnad10/default/1.1.0/3a8445be65795ad88270af4d797034c3d99f70f8352ca658c586faf1cf960881...


Downloading data:   0%|          | 0.00/9.67M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.09M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/9245 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1028 [00:00<?, ? examples/s]

Dataset gnad10 downloaded and prepared to /root/.cache/huggingface/datasets/gnad10/default/1.1.0/3a8445be65795ad88270af4d797034c3d99f70f8352ca658c586faf1cf960881. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

Unnamed: 0,headline,fulltext
0,21-Jähriger fällt wohl bis Saisonende aus,Wien – Rapid muss wohl bis Saisonende auf Off...
1,"Erfundene Bilder zu Filmen, die als verloren g...",Wien – Die Filmgeschichte ist ein Friedhof de...
2,Der frischgekürte CEO Sundar Pichai setzt auf ...,Die Atmosphäre im Silicon Valley ist rau. Da ...
3,"Putin: ""Einigung, dass wir Menge auf Niveau vo...",Moskau – Die russischen Ölproduzenten wollen ...
4,Estland sieht den künftigen österreichischen P...,Wien/Tallinn/Pskow – Die Eltern des künftigen...


Converting the pandas dataframe to a Huggingface dataset object

In [None]:
dataset = Dataset.from_pandas(df)

## Loading the model

In [None]:
model = AutoModelForCausalLM.from_pretrained("malteos/bloom-6b4-clp-german", device_map="auto", load_in_8bit=True)
tokenizer = AutoTokenizer.from_pretrained("malteos/bloom-6b4-clp-german")

Downloading (…)lve/main/config.json:   0%|          | 0.00/427 [00:00<?, ?B/s]

Downloading (…)model.bin.index.json:   0%|          | 0.00/27.5k [00:00<?, ?B/s]

Downloading (…)l_00001-of-00032.bin:   0%|          | 0.00/412M [00:00<?, ?B/s]

Downloading (…)l_00002-of-00032.bin:   0%|          | 0.00/403M [00:00<?, ?B/s]

Downloading (…)l_00003-of-00032.bin:   0%|          | 0.00/403M [00:00<?, ?B/s]

Downloading (…)l_00004-of-00032.bin:   0%|          | 0.00/403M [00:00<?, ?B/s]

Downloading (…)l_00005-of-00032.bin:   0%|          | 0.00/403M [00:00<?, ?B/s]

Downloading (…)l_00006-of-00032.bin:   0%|          | 0.00/403M [00:00<?, ?B/s]

Downloading (…)l_00007-of-00032.bin:   0%|          | 0.00/403M [00:00<?, ?B/s]

Downloading (…)l_00008-of-00032.bin:   0%|          | 0.00/403M [00:00<?, ?B/s]

Downloading (…)l_00009-of-00032.bin:   0%|          | 0.00/403M [00:00<?, ?B/s]

Downloading (…)l_00010-of-00032.bin:   0%|          | 0.00/403M [00:00<?, ?B/s]

Downloading (…)l_00011-of-00032.bin:   0%|          | 0.00/403M [00:00<?, ?B/s]

Downloading (…)l_00012-of-00032.bin:   0%|          | 0.00/403M [00:00<?, ?B/s]

Downloading (…)l_00013-of-00032.bin:   0%|          | 0.00/403M [00:00<?, ?B/s]

Downloading (…)l_00014-of-00032.bin:   0%|          | 0.00/403M [00:00<?, ?B/s]

Downloading (…)l_00015-of-00032.bin:   0%|          | 0.00/403M [00:00<?, ?B/s]

Downloading (…)l_00016-of-00032.bin:   0%|          | 0.00/403M [00:00<?, ?B/s]

Downloading (…)l_00017-of-00032.bin:   0%|          | 0.00/403M [00:00<?, ?B/s]

Downloading (…)l_00018-of-00032.bin:   0%|          | 0.00/403M [00:00<?, ?B/s]

Downloading (…)l_00019-of-00032.bin:   0%|          | 0.00/403M [00:00<?, ?B/s]

Downloading (…)l_00020-of-00032.bin:   0%|          | 0.00/403M [00:00<?, ?B/s]

Downloading (…)l_00021-of-00032.bin:   0%|          | 0.00/403M [00:00<?, ?B/s]

Downloading (…)l_00022-of-00032.bin:   0%|          | 0.00/403M [00:00<?, ?B/s]

Downloading (…)l_00023-of-00032.bin:   0%|          | 0.00/403M [00:00<?, ?B/s]

Downloading (…)l_00024-of-00032.bin:   0%|          | 0.00/403M [00:00<?, ?B/s]

Downloading (…)l_00025-of-00032.bin:   0%|          | 0.00/403M [00:00<?, ?B/s]

Downloading (…)l_00026-of-00032.bin:   0%|          | 0.00/403M [00:00<?, ?B/s]

Downloading (…)l_00027-of-00032.bin:   0%|          | 0.00/403M [00:00<?, ?B/s]

Downloading (…)l_00028-of-00032.bin:   0%|          | 0.00/403M [00:00<?, ?B/s]

Downloading (…)l_00029-of-00032.bin:   0%|          | 0.00/403M [00:00<?, ?B/s]

Downloading (…)l_00030-of-00032.bin:   0%|          | 0.00/403M [00:00<?, ?B/s]

Downloading (…)l_00031-of-00032.bin:   0%|          | 0.00/403M [00:00<?, ?B/s]

Downloading (…)l_00032-of-00032.bin:   0%|          | 0.00/17.3k [00:00<?, ?B/s]


Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 7.5
CUDA SETUP: Detected CUDA version 118
CUDA SETUP: Loading binary /usr/local/lib/python3.9/dist-packages/bitsandbytes/libbitsandbytes_cuda118.so...


  warn(msg)
  warn(msg)
  warn(msg)
  warn(msg)
  warn(msg)


Loading checkpoint shards:   0%|          | 0/32 [00:00<?, ?it/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/291 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/850k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/508k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/2.21M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/99.0 [00:00<?, ?B/s]

In [None]:
template = {
      "min_new_tokens": 5,
      "max_new_tokens": 20,
      "length_penalty": 1.0,
      "no_repeat_ngram_size": 0,
      "num_beams": 1,
      "repetition_penalty": 1.0,
      "diversity_penalty": 0.0,
      "num_beam_groups": 1,
      "do_sample": False,
      "top_k": 50,
      "top_p": 0.9,
      "temperature": 1.0,
      "early_stopping": True,
      "pad_token_id": 0,
      "eos_token_id": 186,
      "num_return_sequences": 1,
      "return_full_text": False,
}

In [None]:
pipe = pipeline("text-generation",
                model=model,
                tokenizer=tokenizer,
                **template)

In [None]:
for out in tqdm(pipe(KeyDataset(dataset, "fulltext"))):
  print(out)

  0%|          | 0/9245 [00:00<?, ?it/s]



[{'generated_text': ' Rapid muss wohl bis Saisonende auf Offensivspieler Thomas Murg verzichten. Der im Winter aus Ried ge'}]
[{'generated_text': ' The Forbidden Room (2015) von Guy Maddin, Evan Johnson. Mit Roy Du'}]
[{'generated_text': ' Google-Chef Pichai: Alle Arschlöcher sind jetzt weg.\n'}]
[{'generated_text': ' Russland will Ölförderung einfrieren\n'}]
[{'generated_text': 'Die Geschichte der Firma\n'}]
[{'generated_text': ' Trodat Trotec Holding GmbH\n'}]
[{'generated_text': ' Heracles Almelo hat sich in den Niederlanden erstmals für einen europäischen Bewerb qualifiziert. Dank eines'}]
[{'generated_text': ' ORF-Stiftungsratssitzung: Wrabetz will Zeitplan und Budget einhalten.Die'}]
[{'generated_text': ' Pakistan: 43 killed in airstrike in North Waziristan\n'}]
[{'generated_text': ' Bundesliga: Hamburger SV – Bayern München (20.30 Uhr)\n'}]
[{'generated_text': ' Der FC Bayern München hat am Sonntag in der deutschen Bundesliga einen wichtigen Sieg gefeiert. Die'}]
[{'generated_te

KeyboardInterrupt: ignored