The `transformers` library by Huggingface provides a convenient interface to perform common NLP tasks. In the rest of the notebook, we will quickly explore the different possible applications on a sample text.

In [1]:
import pandas as pd
from transformers import pipeline

In [2]:
# Excerpt from https://www.economist.com/europe/2023/12/10/europe-a-laggard-in-ai-seizes-the-lead-in-its-regulation
text: str = """
Most important, it is not clear how well the ai act will be enforced—an ongoing problem with recent digital laws passed by the eu, given that it is a club of independent countries. In the case of the gdpr, national data-protection agencies are mainly in charge, which has led to differing interpretations of the rules and less than optimal enforcement. In the case of the Digital Services Act and the Digital Markets Act, two recent laws to regulate online platforms, enforcement is concentrated in Brussels at the commission. The ai act is more of a mix, but experts worry that some national bodies will lack the expertise to prosecute violations, which can lead to fines of up to €35m ($38m) or 7% of a company’s global revenue.
"""

### Sentiment analysis

First, we will use the `text-classification` module to get the sentiment of the text, which is the default form of classification it is configured to do. When we execute for the first time, the relevant models, tokeniser, and vocabulary are downloaded from [Hugging Face Hub](https://oreil.ly/zLK11). The model is downloaded based on whether we have PyTorch, TensorFlow, or Flax installed.  

In [3]:
classifier = pipeline("text-classification")
classifier_outputs = classifier(text)
pd.DataFrame(classifier_outputs)

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


Downloading config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Downloading tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

Downloading vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Unnamed: 0,label,score
0,NEGATIVE,0.998763


The model says that the text is negative in sentiment with a score (probability) of `0.99`. The overall tone of the paragraph is indeed negative so we can say that the model got it right.

### Named entity recognition

Next, let us use the `ner` (*Named Entity Recognition*) module to get the different proper nouns in the text.

In [4]:
ner_tagger = pipeline("ner", aggregation_strategy="simple")
ner_outputs = ner_tagger(text)
pd.DataFrame(ner_outputs)

No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english and revision f2482bf (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


Downloading config.json:   0%|          | 0.00/998 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/1.33G [00:00<?, ?B/s]

Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.weight', 'bert.pooler.dense.bias']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Downloading tokenizer_config.json:   0%|          | 0.00/60.0 [00:00<?, ?B/s]

Downloading vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

Unnamed: 0,entity_group,score,word,start,end
0,MISC,0.997629,Digital Services Act,373,393
1,MISC,0.997892,Digital Markets Act,402,421
2,LOC,0.998881,Brussels,500,508


### Question answering

We can use the `question-answering` module to ask questions whose answer we expect to be present in the text. 

In [5]:
reader = pipeline("question-answering")
question = "What is the maximum fine that can be levied under the AI act?"
reader_outputs = reader(question=question, context=text)
pd.DataFrame([reader_outputs])

No model was supplied, defaulted to distilbert-base-cased-distilled-squad and revision 626af31 (https://huggingface.co/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.


Downloading config.json:   0%|          | 0.00/473 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/261M [00:00<?, ?B/s]

Downloading tokenizer_config.json:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

Downloading tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

Unnamed: 0,score,start,end,answer
0,0.619419,683,687,€35m


### Summarisation

The `summarization` module allows us to pick the key sentences from the text to give a summary.

In [9]:
summariser = pipeline("summarization")
summariser_outputs = summariser(text, max_length=45, clean_up_tokenization_spaces=True)
print(summariser_outputs[0]["summary_text"])

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/1.80k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/1.22G [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/1.22G [00:00<?, ?B/s]

Error while downloading from https://cdn-lfs.huggingface.co/sshleifer/distilbart-cnn-12-6/3bac65d18c99463302d12ca75c2220ea714f9c81ce235f205fa818efe71df6ea?response-content-disposition=attachment%3B+filename*%3DUTF-8%27%27pytorch_model.bin%3B+filename%3D%22pytorch_model.bin%22%3B&response-content-type=application%2Foctet-stream&Expires=1702591746&Policy=eyJTdGF0ZW1lbnQiOlt7IkNvbmRpdGlvbiI6eyJEYXRlTGVzc1RoYW4iOnsiQVdTOkVwb2NoVGltZSI6MTcwMjU5MTc0Nn19LCJSZXNvdXJjZSI6Imh0dHBzOi8vY2RuLWxmcy5odWdnaW5nZmFjZS5jby9zc2hsZWlmZXIvZGlzdGlsYmFydC1jbm4tMTItNi8zYmFjNjVkMThjOTk0NjMzMDJkMTJjYTc1YzIyMjBlYTcxNGY5YzgxY2UyMzVmMjA1ZmE4MThlZmU3MWRmNmVhP3Jlc3BvbnNlLWNvbnRlbnQtZGlzcG9zaXRpb249KiZyZXNwb25zZS1jb250ZW50LXR5cGU9KiJ9XX0_&Signature=Cbd12jbwZkk2zriKH9nr58PklUVxSEsM7iBqvHxY5BqfcLLQ7oCzn8K2qhQ7gjdCGu-b-F3K%7EIq%7E1Ec0XwTnWRF0bCZtBmQ81%7Eleset3z2FZYYyd6AbyZVFg%7ECGtIDWMFL2EBj%7EuN2POzGZUA0tjnCGRohzoXQST8r2hjELSx%7E1cgucQZIocjb-al%7EZk%7EqEr75B4638mhJtd1Eqc9hVA%7EtPmbdH4PnHEnp6Q1hfQDec0GHdJKC2lWcSW19z2oljO

pytorch_model.bin:   6%|6         | 73.4M/1.22G [00:00<?, ?B/s]

Error while downloading from https://cdn-lfs.huggingface.co/sshleifer/distilbart-cnn-12-6/3bac65d18c99463302d12ca75c2220ea714f9c81ce235f205fa818efe71df6ea?response-content-disposition=attachment%3B+filename*%3DUTF-8%27%27pytorch_model.bin%3B+filename%3D%22pytorch_model.bin%22%3B&response-content-type=application%2Foctet-stream&Expires=1702591746&Policy=eyJTdGF0ZW1lbnQiOlt7IkNvbmRpdGlvbiI6eyJEYXRlTGVzc1RoYW4iOnsiQVdTOkVwb2NoVGltZSI6MTcwMjU5MTc0Nn19LCJSZXNvdXJjZSI6Imh0dHBzOi8vY2RuLWxmcy5odWdnaW5nZmFjZS5jby9zc2hsZWlmZXIvZGlzdGlsYmFydC1jbm4tMTItNi8zYmFjNjVkMThjOTk0NjMzMDJkMTJjYTc1YzIyMjBlYTcxNGY5YzgxY2UyMzVmMjA1ZmE4MThlZmU3MWRmNmVhP3Jlc3BvbnNlLWNvbnRlbnQtZGlzcG9zaXRpb249KiZyZXNwb25zZS1jb250ZW50LXR5cGU9KiJ9XX0_&Signature=Cbd12jbwZkk2zriKH9nr58PklUVxSEsM7iBqvHxY5BqfcLLQ7oCzn8K2qhQ7gjdCGu-b-F3K%7EIq%7E1Ec0XwTnWRF0bCZtBmQ81%7Eleset3z2FZYYyd6AbyZVFg%7ECGtIDWMFL2EBj%7EuN2POzGZUA0tjnCGRohzoXQST8r2hjELSx%7E1cgucQZIocjb-al%7EZk%7EqEr75B4638mhJtd1Eqc9hVA%7EtPmbdH4PnHEnp6Q1hfQDec0GHdJKC2lWcSW19z2oljO

pytorch_model.bin:  40%|####      | 493M/1.22G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Your min_length=56 must be inferior than your max_length=45.


 It is not clear how well the ai act will be enforced, given that it is a club of independent countries. In the case of the gdpr, national data-protection agencies are mainly in charge


### Translation

Hugging Face also provides various modules to convert text from one language to another.

In [10]:
translator = pipeline("translation_en_to_de", model="Helsinki-NLP/opus-mt-en-de")
translator_outputs = translator(text, clean_up_tokenization_spaces=True, min_length=100)
print(translator_outputs[0]["translation_text"])

config.json:   0%|          | 0.00/1.33k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/298M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/293 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/42.0 [00:00<?, ?B/s]

source.spm:   0%|          | 0.00/768k [00:00<?, ?B/s]

target.spm:   0%|          | 0.00/797k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.27M [00:00<?, ?B/s]



Am wichtigsten ist, dass nicht klar ist, wie gut das ai-Gesetz durchgesetzt werden wird – ein anhaltendes Problem mit den jüngsten digitalen Gesetzen, die von der eu verabschiedet wurden, da es sich um einen Club unabhängiger Länder handelt. Im Falle des gdpr sind die nationalen Datenschutzagenturen hauptsächlich zuständig, was zu unterschiedlichen Auslegungen der Regeln und weniger als einer optimalen Durchsetzung geführt hat. Im Fall des Digital Services Act und des Digital Markets Act, zwei neuere Gesetze zur Regulierung von Online-Plattformen, konzentriert sich die Durchsetzung in Brüssel bei der Kommission. Der ai-Gesetz ist eher ein Mix, aber Experten befürchten, dass einigen nationalen Gremien das Know-how fehlt, um Verstöße zu verfolgen, was zu Geldbußen von bis zu 35 Mio. € (38 Mio. $) oder 7% der weltweiten Einnahmen eines Unternehmens führen kann.


### Text generation

Finally, we can use the `text-generation` module to generate text based on the text as context and a prompt.

In [11]:
generator = pipeline("text-generation")
response = "There are additional challenges to regulating AI including"
prompt = text + response
generator_outputs = generator(prompt, max_length=200)
print(generator_outputs[0]["generated_text"])

No model was supplied, defaulted to gpt2 and revision 6c0e608 (https://huggingface.co/gpt2).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.



Most important, it is not clear how well the ai act will be enforced—an ongoing problem with recent digital laws passed by the eu, given that it is a club of independent countries. In the case of the gdpr, national data-protection agencies are mainly in charge, which has led to differing interpretations of the rules and less than optimal enforcement. In the case of the Digital Services Act and the Digital Markets Act, two recent laws to regulate online platforms, enforcement is concentrated in Brussels at the commission. The ai act is more of a mix, but experts worry that some national bodies will lack the expertise to prosecute violations, which can lead to fines of up to €35m ($38m) or 7% of a company’s global revenue.
There are additional challenges to regulating AI including the risk that some firms in the industry may be targeted by law enforcement. To achieve this, these issues need to be dealt with first. In order to tackle the
