In [1]:
%%capture
!pip install transformers

## Huggingface

Платформа HuggingFace это коллекция готовых современных предварительно обученных Deep Learning моделей. А библиотека Transformers предоставляет инструменты и интерфейсы для их простой загрузки и использования.

[HuggingFace Модели](https://huggingface.co/models)

### Transformers  Pipeline

Доступные пайплайны:

- feature-extraction (get the vector representation of a text)
- ner (named entity recognition)
- question-answering
- sentiment-analysis
- summarization
- text-generation
- translation
- zero-shot-classification

In [2]:
from transformers import pipeline

#### Классификация

In [3]:
classifier = pipeline("zero-shot-classification")
classifier(
    "This is a course about Machine Learning",
    candidate_labels=["education", "politics", "business"],
)

No model was supplied, defaulted to facebook/bart-large-mnli and revision d7645e1 (https://huggingface.co/facebook/bart-large-mnli).
Using a pipeline without specifying a model name and revision in production is not recommended.
Error while fetching `HF_TOKEN` secret value from your vault: 'Requesting secret HF_TOKEN timed out. Secrets can only be fetched when running from the Colab UI.'.
You are not authenticated with the Hugging Face Hub in this notebook.
If the error persists, please let us know by opening an issue on GitHub (https://github.com/huggingface/huggingface_hub/issues/new).


config.json:   0%|          | 0.00/1.15k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Device set to use cpu


{'sequence': 'This is a course about Machine Learning',
 'labels': ['business', 'education', 'politics'],
 'scores': [0.48862186074256897, 0.3579639494419098, 0.15341418981552124]}

Pipeline называется zero-shot, потому что нет необходимости дообучать модель на своих данных

#### Sentiment-analysis

In [4]:
clf = pipeline(
    task = 'sentiment-analysis',
    model = 's-nlp/russian_toxicity_classifier')

text = [
    'Только дураки нуждается в порядке — гении господствуют над хаосом.',
    'Как минимум два дегенерата в треде, мда.',
    'ИТМО — центр передовой науки и  образования в России'
]

clf(text)

config.json:   0%|          | 0.00/1.04k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/711M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/585 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/1.40M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Device set to use cpu


[{'label': 'toxic', 'score': 0.8172956109046936},
 {'label': 'toxic', 'score': 0.9848678708076477},
 {'label': 'neutral', 'score': 0.9988148212432861}]

#### Генерация текста

In [5]:
generator = pipeline("text-generation")
generator("In this machine learning course, we will learn how to")

No model was supplied, defaulted to openai-community/gpt2 and revision 607a30d (https://huggingface.co/openai-community/gpt2).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Device set to use cpu
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'In this machine learning course, we will learn how to use machine learning technologies such as Caffe, Deep Learning and Machine Learning in order to accelerate computing for better machine learning. This course also gives students important tools for learning machine life cycle. While programming'}]

#### Генерация текста с русскоязычной моделью

In [6]:
generator = pipeline("text-generation", model="sberbank-ai/rugpt3small_based_on_gpt2")
generator(
    "В этом курсе мы научимся применять машинное обучение для",
    max_length=30,
    num_return_sequences=1,
)

config.json:   0%|          | 0.00/720 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/551M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.25k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.71M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/1.27M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/574 [00:00<?, ?B/s]

Device set to use cpu
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


[{'generated_text': 'В этом курсе мы научимся применять машинное обучение для решения задач, связанных с управлением и контролем качества.  В этом курсе мы научимся использовать машин'}]

#### Question Answering

In [7]:
question_answerer = pipeline("question-answering")
question_answerer(
    question="Where do I work?",
    context="My name is Sylvain and I work at Hugging Face in Brooklyn",
)

No model was supplied, defaulted to distilbert/distilbert-base-cased-distilled-squad and revision 564e9b5 (https://huggingface.co/distilbert/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/473 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/261M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

Device set to use cpu


{'score': 0.6949766278266907, 'start': 33, 'end': 45, 'answer': 'Hugging Face'}