## Okey... Let use some pretrained models for different excercises

In [16]:
#FIRSTLY, WE SETUP PACKAGES

from transformers import pipeline, AutoTokenizer, TFAutoModelForSequenceClassification

## Sentiment Analysis

In [17]:
# Experiment for phrase "I love you" in different languages
pipe_1 = pipeline('sentiment-analysis',model = "nlptown/bert-base-multilingual-uncased-sentiment")
pipe_1(['Je tu aime','I love you','Ich liebe dich']) # Ok, for french it has bad results but for english and german it's ok!

[{'label': '5 stars', 'score': 0.4837053120136261},
 {'label': '5 stars', 'score': 0.8546808362007141},
 {'label': '5 stars', 'score': 0.7413736581802368}]

## Filling masks

In [28]:
pipe_2 = pipeline('fill-mask')

pipe_3 = pipeline('fill-mask',model = 'camembert-base')

pipe_2('I love <mask>')[0],pipe_3('Je <mask> aime')[0]


No model was supplied, defaulted to distilroberta-base (https://huggingface.co/distilroberta-base)


({'score': 0.03729768097400665,
  'sequence': 'I love him',
  'token': 123,
  'token_str': ' him'},
 {'score': 0.5958378314971924,
  'sequence': 'Je vous aime',
  'token': 39,
  'token_str': 'vous'})

## Zero-shot-Classification

In [4]:

pipe_4 = pipeline("zero-shot-classification")

sequence = 'pour moi, les mathématiques ont beaucoup de pouvoir dans la société'

candidate_labels = ['sport','espace','science']

pipe_5 = pipeline("zero-shot-classification", model = "cmarkea/distilcamembert-base-nli")
#Okey, let's compare results from different models
#default model = facebook/bart-large-mnli and which find by myself model = distilcamembert-base-nli
pipe_4(sequence,candidate_labels),pipe_5(sequence,candidate_labels,hypothesis_template="Ce texte parle de {}.")

No model was supplied, defaulted to facebook/bart-large-mnli (https://huggingface.co/facebook/bart-large-mnli)
Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


({'labels': ['science', 'espace', 'sport'],
  'scores': [0.9355031847953796, 0.05409245193004608, 0.010404355823993683],
  'sequence': 'pour moi, les mathématiques ont beaucoup de pouvoir dans la société'},
 {'labels': ['espace', 'science', 'sport'],
  'scores': [0.42121946811676025, 0.4060155749320984, 0.17276489734649658],
  'sequence': 'pour moi, les mathématiques ont beaucoup de pouvoir dans la société'})

## Okey, here we are working with AutoTokenizer, TFAutoModelForSequenceClassification

For example, we will take model from pipe_1 and watch what is going on, when we use pipeline

In [5]:

model_name = "nlptown/bert-base-multilingual-uncased-sentiment"
model = TFAutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

Downloading:   0%|          | 0.00/639M [00:00<?, ?B/s]

All model checkpoint layers were used when initializing TFBertForSequenceClassification.

All the layers of TFBertForSequenceClassification were initialized from the model checkpoint at nlptown/bert-base-multilingual-uncased-sentiment.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertForSequenceClassification for predictions without further training.


In [8]:
texts = ['Je tu aime','I love you','Ich liebe dich']
inputs = tokenizer(texts)
inputs

{'input_ids': [[101, 10149, 10689, 37942, 102], [101, 151, 11157, 10855, 102], [101, 12373, 24941, 19229, 102]], 'token_type_ids': [[0, 0, 0, 0, 0], [0, 0, 0, 0, 0], [0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1], [1, 1, 1, 1, 1], [1, 1, 1, 1, 1]]}

In [9]:
inputs_with_padding = tokenizer(texts, padding = True, truncation = True, max_length = 256, return_tensors="tf")
inputs_with_padding

{'input_ids': <tf.Tensor: shape=(3, 5), dtype=int32, numpy=
array([[  101, 10149, 10689, 37942,   102],
       [  101,   151, 11157, 10855,   102],
       [  101, 12373, 24941, 19229,   102]], dtype=int32)>, 'token_type_ids': <tf.Tensor: shape=(3, 5), dtype=int32, numpy=
array([[0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0]], dtype=int32)>, 'attention_mask': <tf.Tensor: shape=(3, 5), dtype=int32, numpy=
array([[1, 1, 1, 1, 1],
       [1, 1, 1, 1, 1],
       [1, 1, 1, 1, 1]], dtype=int32)>}

In [10]:
outputs = model(inputs_with_padding)
outputs

TFSequenceClassifierOutput([('logits',
                             <tf.Tensor: shape=(3, 5), dtype=float32, numpy=
                             array([[-2.2876294 , -1.9347584 ,  0.45290372,  1.5439389 ,  1.8068829 ],
                                    [-2.0086846 , -2.101042  , -0.6899481 ,  1.0338992 ,  3.0443912 ],
                                    [-1.7996992 , -2.0554729 , -0.512829  ,  1.2688576 ,  2.545792  ]],
                                   dtype=float32)>)])

In [20]:
import tensorflow as tf
predictions = tf.nn.softmax(outputs[0], axis=-1)

float(predictions[0][4]), float(predictions[1][4]), float(predictions[2][4])

(0.4837052524089813, 0.8546808362007141, 0.741373598575592)

Recall results from pipe_1 and compare it

In [19]:
pipe_1(['Je tu aime','I love you','Ich liebe dich'])

[{'label': '5 stars', 'score': 0.4837053120136261},
 {'label': '5 stars', 'score': 0.8546808362007141},
 {'label': '5 stars', 'score': 0.7413736581802368}]