<a href="https://colab.research.google.com/github/snipaid-nlg/models/blob/main/bloomz-english-centric-demo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Bloomz Demo
This is a notebook to experiment with the capabilities of the english [bloomz-3b](https://huggingface.co/bigscience/bloomz-3b) model. 
To have it perform tasks text is translated to English and translated back after processing.

In [1]:
#@title Setup
#@markdown Run this cell to get started...

#print("This is your hardware setup...")
#!nvidia-smi

print("Loading packages...")
!pip install -q transformers accelerate bitsandbytes
!pip install -U -q deep-translator
!pip install -q newspaper3k

print("\nImporting...")
from transformers import AutoModelForCausalLM, AutoModelForSeq2SeqLM, AutoTokenizer, set_seed
from deep_translator import GoogleTranslator

import torch

from newspaper import Article

print("\nSetting defaults...")
torch.set_default_tensor_type(torch.cuda.FloatTensor)
set_seed(424242)

print("\nLoading the model...")
model = AutoModelForCausalLM.from_pretrained("bigscience/bloomz-3b", use_cache=True)
tokenizer = AutoTokenizer.from_pretrained("bigscience/bloomz-3b")

print("\nDone! You are good to go. Have fun experimenting...")

Loading packages...
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.3/6.3 MB[0m [31m87.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m199.7/199.7 KB[0m [31m14.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m76.3/76.3 MB[0m [31m15.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m190.3/190.3 KB[0m [31m25.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.6/7.6 MB[0m [31m107.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m129.4/129.4 KB[0m [31m16.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m211.1/211.1 KB[0m [31m17.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.4/7.4 MB[0

Downloading (…)lve/main/config.json:   0%|          | 0.00/715 [00:00<?, ?B/s]

Downloading (…)"pytorch_model.bin";:   0%|          | 0.00/6.01G [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/199 [00:00<?, ?B/s]

Downloading (…)"tokenizer.json";:   0%|          | 0.00/14.5M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/85.0 [00:00<?, ?B/s]


Done! You are good to go. Have fun experimenting...


In [None]:
print("\nDefining templates...")
templates = {
    "keywords": {
        "task_prefix": "Your task is to find the keywords in a text. Text:",
        "prompt": "\nKeywords: ",
        "params": {
            "min_new_tokens": 1,
            "max_new_tokens": 150,
        }
    },
    "title": {
        "task_prefix": "Your task is to write a news title for a given text. The title should be at least three words long (eight characters). It is important that you include all relevant information about what happened on this day: date/time/situations involved etc., but don\'t overwhelm with too much detail - just enough so people can read quickly without having their attention distracted by unnecessary details. Text:",
        "prompt": "\nTitle: ",
        "params": {
            "min_new_tokens": 7,
            "max_new_tokens": 20,
        }
    },
    "teaser": {
        "task_prefix": "Your task is to write a Teaser for a given text. A Teaser should whet the appetite of the reader to read the rest of the text. In one or two sentences introduce the topic of the text but don’t explain it completely. Ideally, the following train of thought takes place in the reader’s mind: “Ooh, that’s interesting! I didn’t know that. I have to read it!” This cinema of the mind is especially created when you put signal words at the beginning of the teaser, for example superlatives and enumerations. Follow the pattern stimulus, thesis, cliffhanger. \nText:",
        "prompt": "\nTeaser:",
        "params": {
            "min_new_tokens": 30,
            "max_new_tokens": 60
        }
    },
    "summary": {
        "task_prefix": "Your task is to write a short summary for a text (one or two sentences). Include all relevant information from the text: Cover the main idea that you want readers to know most about that particular piece of text; it should be clear enough so they can understand immediately how important its topic really is. \nText:",
        "prompt": "\nSummary:",
        "params": {
            "min_new_tokens": 50,
            "max_new_tokens": 150
        }
    },
    "tweet":{
        "task_prefix": "Your task is to write a Tweet about something funny, surprising or shocking from the text. Make sure it is short and sweet so people can read the whole thing in one go! \nText:",
        "prompt": "\nTweet:",
        "params": {
            "min_new_tokens": 20,
            "max_new_tokens": 80,
        }
    }
}
print("done")


Defining templates...
done


In [None]:
#@title Artikel per URL einlesen
#@markdown Lade einen Artikel aus dem Internet.
url = 'https://t3n.de/news/hippocamera-app-gegen-demenz-1535827/' #@param {type:"string"}
article = Article(url, browser_user_agent="Googlebot-News")
article.download()
article.parse()
article.text

'Das Gedächtnis lässt sich genauso trainieren, wie ein Muskel. (Bild: LightField Studios)\n\nUnser Gehirn ist wie ein Muskel: Wird es nicht trainiert, verliert es auf Dauer an Funktionalität. Damit steigt im Alter das Risiko, an degenerativen Erkrankungen wie Demenz zu leiden.\n\nUmso wichtiger ist es daher, dass wir unserem Gehirn regelmäßig Anreize setzen, um sich zu erinnern. Genau dafür wurde die App Hippocamera entwickelt.\n\nSo arbeitet Hippocamera\n\nHippocamera setzt dafür – wie es der Name vermuten lässt – am sogenannten Hippocampus an, dem Bereich unseres Gehirns, der für das Kurzzeitgedächtnis zuständig ist. Damit solche Erinnerungen abrufbar sind, arbeitet die App mit Videos, die mit einer kurzen Tonspur unterlegt sind.\n\nUnd das funktioniert so: Zunächst werden die Nutzer:innen bei der Aufnahme in mehreren Schritten speziell angeleitet, um das Kurzzeitgedächtnis zu stimulieren.\n\nIm nächsten Schritt kreiert die App aus den kurzen Clips Hinweise und Anreize, die dann nach

In [None]:
#@title Try all snippets (with task prefix)
#@markdown <-- Hit play and keywords and a title suggestion will appear below...

input_article_en = GoogleTranslator(source='auto', target='en').translate(article.text[:4500])

snippet_type = "keywords"
prompt = f'{templates[snippet_type]["task_prefix"]} {input_article_en} {templates[snippet_type]["prompt"]}'
input_ids = tokenizer(prompt, return_tensors="pt").to(0)
sample = model.generate(**input_ids, **templates[snippet_type]["params"], top_k=50, top_p=0.75)
keywords = tokenizer.decode(sample[0]).split(templates[snippet_type]["prompt"])[1].strip().strip("</s>").split(', ')
print(f"Extracted Keywords: {keywords}")

snippet_type = "title"
prompt = f'{templates[snippet_type]["task_prefix"]} {input_article_en} {templates[snippet_type]["prompt"]}'
input_ids = tokenizer(prompt, return_tensors="pt").to(0)
sample = model.generate(**input_ids, **templates[snippet_type]["params"], top_k=50, top_p=0.75)
title = tokenizer.decode(sample[0]).split(templates[snippet_type]["prompt"])[1].strip().strip("</s>")
title_de = GoogleTranslator(source='auto', target='de').translate(title)
print(f"Suggested title: {title_de}")

snippet_type = "teaser"
prompt = f'{templates[snippet_type]["task_prefix"]} {input_article_en} {templates[snippet_type]["prompt"]}'
input_ids = tokenizer(prompt, return_tensors="pt").to(0)
sample = model.generate(**input_ids, **templates[snippet_type]["params"], top_k=50, top_p=0.75)
teaser = tokenizer.decode(sample[0]).split(templates[snippet_type]["prompt"])[1].strip().strip("</s>")
teaser_de = GoogleTranslator(source='auto', target='de').translate(teaser)
print(f"Suggested teaser: {teaser_de}")

snippet_type = "summary"
prompt = f'{templates[snippet_type]["task_prefix"]} {input_article_en} {templates[snippet_type]["prompt"]}'
input_ids = tokenizer(prompt, return_tensors="pt").to(0)
sample = model.generate(**input_ids, **templates[snippet_type]["params"], top_k=50, top_p=0.75)
summary = tokenizer.decode(sample[0]).split(templates[snippet_type]["prompt"])[1].split("\n")[0].strip().strip("</s>")
summary_de = GoogleTranslator(source='auto', target='de').translate(summary)
print(f"Summary: {summary_de}")

snippet_type = "tweet"
prompt = f'{templates[snippet_type]["task_prefix"]} {input_article_en} {templates[snippet_type]["prompt"]}'
input_ids = tokenizer(prompt, return_tensors="pt").to(0)
sample = model.generate(**input_ids, **templates[snippet_type]["params"], top_k=50, top_p=0.75)
tweet = tokenizer.decode(sample[0]).split(templates[snippet_type]["prompt"])[1].strip().strip("</s>")
tweet_de = GoogleTranslator(source='auto', target='de').translate(tweet)
print(f"Tweet: {tweet_de}")

Extracted Keywords: ["Alzheimer's disease", 'Memory', 'Dementia']
Suggested title: Gedächtnistrainings-App hilft Demenz vorzubeugen
Suggested teaser: Das Gehirn ist wie ein Muskel und muss regelmäßig trainiert werden, um gesund zu bleiben. Genau dafür ist die neue App Hippocamera konzipiert.
Summary: Die Deutsche Alzheimer Gesellschaft hat eine neue App entwickelt, die helfen kann, der Entstehung von Demenz vorzubeugen. Die App heißt Hippocamera und soll das Gehirn stimulieren. Die App basiert auf der Idee, dass das Gehirn wie ein Muskel ist.
Tweet: Die App, die Ihr Gehirn trainiert, sich Dinge besser zu merken (und warum das so wichtig ist) #Demenz


In [None]:
#@title Example for possible more advanced generation techniques
#@markdown <-- Hit play ro generate keywords and a suggestion for a title that starts with one of the keywords...

input_article_en = GoogleTranslator(source='auto', target='en').translate(article.text[:4500])

snippet_type = "keywords"
prompt = f'{input_article_en} \nCore concept: '
input_ids = tokenizer(prompt, return_tensors="pt").to(0)
sample = model.generate(**input_ids, **templates[snippet_type]["params"], top_k=50, top_p=0.75)
core_concept = tokenizer.decode(sample[0]).split("Core concept: ")[1].strip().strip("</s>")
print(f"Main topic: {core_concept}")

snippet_type="title"
prompt = f'{input_article_en} {templates[snippet_type]["prompt"]} {core_concept}'
input_ids = tokenizer(prompt, return_tensors="pt").to(0)
sample = model.generate(**input_ids, **templates[snippet_type]["params"], top_k=50, top_p=0.75)
output = tokenizer.decode(sample[0]).split(templates[snippet_type]["prompt"])[1].strip().strip("</s>")
output_de = GoogleTranslator(source='auto', target='de').translate(output)
print(f"Suggested title: {output_de}")

snippet_type = "keywords"
prompt = f'{templates[snippet_type]["task_prefix"]} {input_article_en} {templates[snippet_type]["prompt"]}'
input_ids = tokenizer(prompt, return_tensors="pt").to(0)
sample = model.generate(**input_ids, **templates[snippet_type]["params"], top_k=50, top_p=0.75)
keywords = tokenizer.decode(sample[0]).split(templates[snippet_type]["prompt"])[1].strip().strip("</s>").split(', ')
print(f"Extracted Keywords: {keywords}")

for kw in keywords:
  snippet_type="title"
  prompt = f'{input_article_en} {templates[snippet_type]["prompt"]} {kw}'
  input_ids = tokenizer(prompt, return_tensors="pt").to(0)
  sample = model.generate(**input_ids, **templates[snippet_type]["params"], top_k=50, top_p=0.75)
  output = tokenizer.decode(sample[0]).split(templates[snippet_type]["prompt"])[1].strip().strip("</s>")
  output_de = GoogleTranslator(source='auto', target='de').translate(output)
  print(f"Suggested title: {output_de}")

Main topic: Memory training
Suggested title: Gedächtnistrainings-App, die das Gehirn trainiert, sich zu erinnern
Extracted Keywords: ["Alzheimer's disease", 'Memory', 'Dementia']
Suggested title: Alzheimer-Krankheit: Die App, die Ihr Gehirn trainiert, sich zu erinnern
Suggested title: Gedächtnistrainings-App, die das Gehirn trainiert, sich zu erinnern
Suggested title: Mit einer einfachen App lässt sich Demenz vorbeugen


### Model "Knowledge" Tests

In [2]:
prompt="Explain to me like an expert. What is a twitter-length elevator pitch sentence?"
input_ids = tokenizer(prompt, return_tensors="pt").to(0)
sample = model.generate(**input_ids, min_new_tokens=50, max_new_tokens=200, top_k=50, top_p=0.75, repetition_penalty=1.5)
tokenizer.decode(sample[0])

OutOfMemoryError: ignored

### Prompt Testing Playground

#### One article

In [None]:
input_article_en = GoogleTranslator(source='auto', target='en').translate(article.text[:4500])

prompt = f"""You are an editor in the editorial department of a renowned newspaper. Your task is to write viral headlines for news articles. In order to stay as close as possible to the facts of the story, you stick to what you know from the text. 
A colleague has just sent in a new text:  {input_article_en}
What is the angle of the text: """

input_ids = tokenizer(prompt, return_tensors="pt").to(0)
sample = model.generate(**input_ids, min_new_tokens=5, max_new_tokens=25, top_k=50, top_p=0.75)
headlines = tokenizer.decode(sample[0]).split("the text:")[1].strip().strip("</s>")
headlines_de = GoogleTranslator(source='auto', target='de').translate(headlines)
print(f"Headline suggestions: {headlines_de}")

Headline suggestions: Wie Sie Ihr Gehirn trainieren, um sich zu erinnern


#### t3n rss feed

In [4]:
!pip install feedparser

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [5]:
import feedparser

feed = feedparser.parse('http://www.t3n.de/rss.xml')

In [12]:
print("\nDefining templates...")
templates = {
    "keywords": {
        "task_prefix": "",
        "prompt": "\nKeywords: ",
        "params": {
            "min_new_tokens": 1,
            "max_new_tokens": 80,
            "repetition_penalty": 1.05,
            "top_k": 50, 
            "top_p": 0.75
        }
    },
    "title": {
        "task_prefix": "",
        "prompt": "\nWhat is the best title for this article? ",
        "params": {
            "min_new_tokens": 3,
            "max_new_tokens": 20,
            "length_penalty": 1.0,
            "no_repeat_ngram_size": 0,
            "repetition_penalty": 1.0,
            "diversity_penalty": 0.0,
            "num_beam_groups": 1,
            "do_sample": False,
            "temperature": 1.0,
            "early_stopping": False,
            "pad_token_id": 3,
            "eos_token_id": 2,
            "num_return_sequences": 1,
            "top_k": 50, 
            "top_p": 0.75
        }
    },
    "teaser": {
        "task_prefix": "",
        "prompt": "\nWrite a one or two sentence news hook/teaser/lede/bait: ",
        "params": {
            "min_new_tokens": 30,
            "max_new_tokens": 60,
            "top_k": 50, 
            "top_p": 0.75,
            "no_repeat_ngram_size": 2,
        }
    },
    "summary": {
        "task_prefix": "News article: ",
        "prompt": "\nSummary in two to three sentences: ",
        "params": {
            "min_new_tokens": 50,
            "max_new_tokens": 150,
            "top_k": 50, 
            "top_p": 0.75,
            "no_repeat_ngram_size": 2,
        }
    },
    "tweet":{
        "task_prefix": "",
        "prompt": "\nIn a nutshell sentence in present tense: ",
        "params": {
            "min_new_tokens": 20,
            "max_new_tokens": 80,
            "top_k": 50, 
            "top_p": 0.75
        }
    }
}
print("done")


Defining templates...
done


In [7]:
import re

In [8]:
def generate_snippet(snippet_type, input, starter=""):
  prompt = f'{templates[snippet_type]["task_prefix"]} {input} {templates[snippet_type]["prompt"]}{starter}'
  input_ids = tokenizer(prompt, return_tensors="pt").to(0)
  sample = model.generate(**input_ids, **templates[snippet_type]["params"])
  output = tokenizer.decode(sample[0]).split(templates[snippet_type]["prompt"])[1].strip().strip("</s>")
  output_de = GoogleTranslator(source='auto', target='de').translate(output)
  return output_de

def find_quotes(text):
  quotes = []
  quotes += re.findall(r'„(.*?)“', text)
  quotes += re.findall(r'"(.*?)"', text)
  quotes += re.findall(r"'(.*?)'", text)
  return quotes

In [13]:
items = feed['entries']
print(len(items))

for item in items[:5]:
      print(item["title"])
      # Get the article fulltext
      article = Article(item["link"], browser_user_agent="Googlebot-News")
      article.download()
      article.parse()
      # Translate
      input_article_en = GoogleTranslator(source='auto', target='en').translate(article.text[:2000])
      # Test the prompt
      print("Keywords:", generate_snippet('keywords', input_article_en))
      print("Title:", generate_snippet('title', input_article_en))
      print("Teaser:", generate_snippet('teaser', input_article_en))
      print("Summary:", generate_snippet('summary', input_article_en))
      print("Quotes:", find_quotes(article.text))
      print(item["link"])
      print("__________")

20
EU will Datenmaut von Netflix, Google und Co. - doch wer muss zahlen?
Keywords: Internet, Daten, Internetzugang
Title: EU-Kommission startet Konsultation zu Internetzugangsentgelt
Teaser: Die EU will Tech-Unternehmen Gebühren für die Nutzung ihres Netzes in Rechnung stellen. Die Kommission prüft derzeit ein sogenanntes „Connectivité-Paket“, das Technologieunternehmen dazu verpflichten würde, für die Nutzung von EU-Telekommunikationsnetzen zu zahlen.
Summary: Die Europäische Kommission erwägt eine neue Verordnung, die Technologieunternehmen dazu verpflichten würde, für die Nutzung ihres Netzwerks zu bezahlen. Die Verordnung würde auch verlangen, dass die Unternehmen, die die Dienste erbringen, Gebühren für deren Nutzung erheben. Es wäre das erste Mal, dass eine solche Regelung in Europa vorgeschlagen wird.
Quotes: ['öffentlichen Konsultation', 'Connectivity Package', 'öffentliche Konsultation', 'eine fairere Verteilung der Last', 'Daten unabhängig von deren Herkunft, Inhalt, Anwendun

In [None]:
#@title Older template versions

print("\nDefining templates...")
templates = {
    "keywords": {
        "task_prefix": "Your task is to find the keywords in a text. Text:",
        "prompt": "\nKeywords: ",
        "params": {
            "min_new_tokens": 1,
            "max_new_tokens": 150,
        }
    },
    "title": {
        "task_prefix": "Your task is to write a news title for a given text. The title should be at least three words long (eight characters). It is important that you include all relevant information about what happened on this day: date/time/situations involved etc., but don\'t overwhelm with too much detail - just enough so people can read quickly without having their attention distracted by unnecessary details. Text:",
        "prompt": "\nTitle: ",
        "params": {
            "min_new_tokens": 7,
            "max_new_tokens": 20,
        }
    },
    "teaser": {
        "task_prefix": "Your task is to write a Teaser for a given text. A Teaser should whet the appetite of the reader to read the rest of the text. In one or two sentences introduce the topic of the text but don’t explain it completely. Ideally, the following train of thought takes place in the reader’s mind: “Ooh, that’s interesting! I didn’t know that. I have to read it!” This cinema of the mind is especially created when you put signal words at the beginning of the teaser, for example superlatives and enumerations. Follow the pattern stimulus, thesis, cliffhanger. \nText:",
        "prompt": "\nTeaser:",
        "params": {
            "min_new_tokens": 30,
            "max_new_tokens": 60
        }
    },
    "summary": {
        "task_prefix": "Your task is to write a short summary for a text (one or two sentences). Include all relevant information from the text: Cover the main idea that you want readers to know most about that particular piece of text; it should be clear enough so they can understand immediately how important its topic really is. \nText:",
        "prompt": "\nSummary:",
        "params": {
            "min_new_tokens": 50,
            "max_new_tokens": 150
        }
    },
    "tweet":{
        "task_prefix": "Your task is to write a Tweet about something funny, surprising or shocking from the text. Make sure it is short and sweet so people can read the whole thing in one go! \nText:",
        "prompt": "\nTweet:",
        "params": {
            "min_new_tokens": 20,
            "max_new_tokens": 80,
        }
    }
}

## Prompt collection

### Title rewrite (GPT-3)
`-` at the end prompts for list

```
Rewrite the following blog post title into six different titles but optimized for social media virality: <FILL IN TITLE>

-
```

