In [None]:
!pip install transformers

In [None]:
!pip install datasets

## Working with the pipelines:

It regroups all the steps to make raw data to usable predictions.

It includes all the preprocessing (i.e. from text to number) for models.

And also some postprocessing to make output human readable.




In [None]:
from transformers import pipeline

In [None]:
classifier = pipeline('sentiment-analysis')

In [None]:
classifier("i've been waiting for a huggingface course my whole life.")

[{'label': 'POSITIVE', 'score': 0.9598048329353333}]

In [None]:
classifier(["i've been waiting for a huggingface course my whole life.","I don't like you."])

[{'label': 'POSITIVE', 'score': 0.9598048329353333},
 {'label': 'NEGATIVE', 'score': 0.9990959167480469}]

In [None]:
classifier_1 = pipeline("zero-shot-classification")

In [None]:
classifier_1(["Who is the president of USA?","I love playing football","Crowds are going crazy in today's match","When will the exam takes place?"],candidate_labels=["sports","education","politics","random","fun"])

[{'labels': ['politics', 'fun', 'random', 'education', 'sports'],
  'scores': [0.4547997713088989,
   0.42136669158935547,
   0.0626106783747673,
   0.03396724537014961,
   0.027255551889538765],
  'sequence': 'Who is the president of USA?'},
 {'labels': ['sports', 'fun', 'random', 'education', 'politics'],
  'scores': [0.7113533616065979,
   0.2855994999408722,
   0.002072680275887251,
   0.0005331700667738914,
   0.0004413310962263495],
  'sequence': 'I love playing football'},
 {'labels': ['sports', 'fun', 'politics', 'random', 'education'],
  'scores': [0.5569139719009399,
   0.4122180938720703,
   0.012931907549500465,
   0.010582774877548218,
   0.007353218737989664],
  'sequence': "Crowds are going crazy in today's match"},
 {'labels': ['education', 'fun', 'random', 'politics', 'sports'],
  'scores': [0.5308701992034912,
   0.2790766656398773,
   0.07033061236143112,
   0.06889419257640839,
   0.05082828551530838],
  'sequence': 'When will the exam takes place?'}]

In [None]:
generator = pipeline("text-generation")

No model was supplied, defaulted to gpt2 (https://huggingface.co/gpt2)


Downloading:   0%|          | 0.00/665 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/548M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

In [None]:
generator("I want to play football so much but")

Using pad_token, but it is not set yet.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': "I want to play football so much but I don't want to make mistakes. If you're not careful, you can have a team with problems and you can lose. I don't like to keep my game focused on me and I do it as"}]

In [None]:
generator = pipeline("text-generation",model="openai-gpt")

Downloading:   0%|          | 0.00/656 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/479M [00:00<?, ?B/s]

Some weights of OpenAIGPTLMHeadModel were not initialized from the model checkpoint at openai-gpt and are newly initialized: ['lm_head.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Downloading:   0%|          | 0.00/816k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/458k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.27M [00:00<?, ?B/s]

In [None]:
generator("I want to play football so much but")

Using pad_token, but it is not set yet.


[{'generated_text': 'I want to play football so much but i \'m so tired . " she dropped heavily onto the couch and hugged her arms around her knees . " i feel horrible about what happened today . " \n " do n\'t be , " her mother said , putting'}]

In [None]:
generator("I want to play football so much but",
          max_length =30,
          num_return_sequences= 2,)

Using pad_token, but it is not set yet.


[{'generated_text': 'I want to play football so much but ... i feel like a dork for not seeing you at the game last night . " \n he sighed . "'},
 {'generated_text': 'I want to play football so much but i can live with my new school . " she looked down at her feet instead , but he knew what was'}]

## Working process of pipeline

Tokenizer and Model:

Raw-data --> Numerical Representation --> Logits --> Prediction

First preprocessing:

text-->tokens--> Addiing special tokens --> Numerical Representation via vocabulary

We AutoTokenizer api on transformer which will help with tokenizer and .from_pretrained attribute.

In [None]:
from transformers import AutoTokenizer

In [None]:
checkpoint="roberta-large-mnli"

tokenizer = AutoTokenizer.from_pretrained(checkpoint)

Downloading:   0%|          | 0.00/688 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

In [None]:
raw_inputs = [
    "I've been waiting for a HuggingFace course my whole life.", 
    "I hate this so much!",
]


In [None]:
inputs = tokenizer(raw_inputs, padding=True, truncation=True, return_tensors="pt")
print(inputs)

{'input_ids': tensor([[    0,   100,   348,    57,  2445,    13,    10, 30581,  3923, 34892,
           768,   127,  1086,   301,     4,     2],
        [    0,   100,  4157,    42,    98,   203,   328,     2,     1,     1,
             1,     1,     1,     1,     1,     1]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]])}


In [None]:
from transformers import AutoModel

model = AutoModel.from_pretrained(checkpoint)

Downloading:   0%|          | 0.00/1.43G [00:00<?, ?B/s]

Some weights of the model checkpoint at roberta-large-mnli were not used when initializing RobertaModel: ['classifier.dense.weight', 'classifier.out_proj.weight', 'classifier.out_proj.bias', 'classifier.dense.bias']
- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [None]:
outputs = model(**inputs)
print(outputs.last_hidden_state.shape)

torch.Size([2, 16, 1024])


Note that the outputs of 🤗 Transformers models behave like namedtuples or dictionaries. You can access the elements by attributes (like we did) or by key (outputs["last_hidden_state"]), or even by index if you know exactly where the thing you are looking for is (outputs[0]).

The model heads take the high-dimensional vector of hidden states as input and project them onto a different dimension. They are usually composed of one or a few linear layers:



There are different architecture available for complete model i.e. transformer model and head. Each architecture will perform specific task.

can get list of available on : https://huggingface.co/transformers/model_doc/auto.html


In [None]:
from transformers import AutoModelForSequenceClassification

In [None]:
model = AutoModelForSequenceClassification.from_pretrained("roberta-large-mnli")

Some weights of the model checkpoint at roberta-large-mnli were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [None]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

In [None]:
!pip install sentencepiece

Collecting sentencepiece
  Downloading sentencepiece-0.1.96-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB)
[?25l[K     |▎                               | 10 kB 22.4 MB/s eta 0:00:01[K     |▌                               | 20 kB 28.3 MB/s eta 0:00:01[K     |▉                               | 30 kB 14.1 MB/s eta 0:00:01[K     |█                               | 40 kB 10.6 MB/s eta 0:00:01[K     |█▍                              | 51 kB 4.5 MB/s eta 0:00:01[K     |█▋                              | 61 kB 4.8 MB/s eta 0:00:01[K     |██                              | 71 kB 4.4 MB/s eta 0:00:01[K     |██▏                             | 81 kB 5.0 MB/s eta 0:00:01[K     |██▍                             | 92 kB 5.0 MB/s eta 0:00:01[K     |██▊                             | 102 kB 4.3 MB/s eta 0:00:01[K     |███                             | 112 kB 4.3 MB/s eta 0:00:01[K     |███▎                            | 122 kB 4.3 MB/s eta 0:00:01[K     |███▌      

In [None]:
import torch

from transformers import PegasusForConditionalGeneration, PegasusTokenizer

# model_name = 'mrm8488/spanish-t5-small-sqac-for-qa'
model_name = 'seduerr/mt5-paraphrases-espanol'

torch_device = 'cuda' if torch.cuda.is_available() else 'cpu'

tokenizer = AutoTokenizer.from_pretrained(model_name)

model = AutoModelForSeq2SeqLM.from_pretrained(model_name).to(torch_device)

def get_response(input_text,num_return_sequences,num_beams):

  batch = tokenizer([input_text],truncation=True,padding='longest',max_length=60, return_tensors="pt").to(torch_device)

  translated = model.generate(**batch,max_length=60,num_beams=num_beams, num_return_sequences=num_return_sequences, temperature=1.5)

  tgt_text = tokenizer.batch_decode(translated, skip_special_tokens=True)

  return tgt_text



In [None]:
a= get_response("Necesito ayuda con algo de trabajo",10,10)