# LANGUAGE MODELING

## Masked Language Modeling

In [44]:
from transformers import pipeline

In [45]:
nlp = pipeline("fill-mask")

All model checkpoint layers were used when initializing TFRobertaForMaskedLM.

All the layers of TFRobertaForMaskedLM were initialized from the model checkpoint at distilroberta-base.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFRobertaForMaskedLM for predictions without further training.


In [46]:
from pprint import pprint

In [49]:
pprint(nlp(f"Samsung {nlp.tokenizer.mask_token} is based on Android Operating systems"))

[{'score': 0.5290902853012085,
  'sequence': 'Samsung Galaxy is based on Android Operating systems',
  'token': 5325,
  'token_str': ' Galaxy'},
 {'score': 0.08044127374887466,
  'sequence': 'Samsung tablet is based on Android Operating systems',
  'token': 9995,
  'token_str': ' tablet'},
 {'score': 0.05760441720485687,
  'sequence': 'Samsung Gear is based on Android Operating systems',
  'token': 17720,
  'token_str': ' Gear'},
 {'score': 0.03700081259012222,
  'sequence': 'Samsung Chromebook is based on Android Operating systems',
  'token': 27202,
  'token_str': ' Chromebook'},
 {'score': 0.030839569866657257,
  'sequence': 'Samsung Tablet is based on Android Operating systems',
  'token': 37583,
  'token_str': ' Tablet'}]


In [54]:
from transformers import TFAutoModelWithLMHead, AutoTokenizer
import tensorflow as tf

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-cased")
model = TFAutoModelWithLMHead.from_pretrained("distilbert-base-cased")

sequence = f"Samsung {tokenizer.mask_token} is based on Android Operating systems."
input = tokenizer.encode(sequence, return_tensors="tf")
mask_token_index = tf.where(input == tokenizer.mask_token_id)[0, 1]

token_logits = model(input)[0]
mask_token_logits = token_logits[0, mask_token_index, :]
top_5_tokens = tf.math.top_k(mask_token_logits, 5).indices.numpy()

Some layers from the model checkpoint at distilbert-base-cased were not used when initializing TFDistilBertForMaskedLM: ['activation_13']
- This IS expected if you are initializing TFDistilBertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFDistilBertForMaskedLM were initialized from the model checkpoint at distilbert-base-cased.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFDistilBertForMaskedLM for predictions without further training.


In [55]:
top_5_tokens

array([ 3594, 12783, 18813,  5418,  2815], dtype=int32)

In [56]:
for token in top_5_tokens:
    print(sequence.replace(tokenizer.mask_token, tokenizer.decode([token])))

Samsung software is based on Android Operating systems.
Samsung computing is based on Android Operating systems.
Samsung desktop is based on Android Operating systems.
Samsung mode is based on Android Operating systems.
Samsung technology is based on Android Operating systems.


## Causal Language Modeling

In [63]:
from transformers import TFAutoModelWithLMHead, AutoTokenizer, tf_top_k_top_p_filtering
import tensorflow as tf

tokenizer = AutoTokenizer.from_pretrained("gpt2")
model = TFAutoModelWithLMHead.from_pretrained("gpt2")

sequence = f"How do u  feel about "

input_ids = tokenizer.encode(sequence, return_tensors="tf")
 # get logits of last hidden state
next_token_logits = model(input_ids)[0][:, -1, :]

# filter
filtered_next_token_logits = tf_top_k_top_p_filtering(next_token_logits, top_k=50, top_p=1.0)
 # sample
next_token = tf.random.categorical(filtered_next_token_logits, dtype=tf.int32, num_samples=1)

generated = tf.concat([input_ids, next_token], axis=1)

resulting_string = tokenizer.decode(generated.numpy().tolist()[0])

All model checkpoint layers were used when initializing TFGPT2LMHeadModel.

All the layers of TFGPT2LMHeadModel were initialized from the model checkpoint at gpt2.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFGPT2LMHeadModel for predictions without further training.


In [65]:
print(resulting_string)

How do u  feel about  you


In [60]:
from transformers import pipeline

# NLP Tasks

## SENTIMENT ANALYSIS

In [66]:
nlp = pipeline("sentiment-analysis")


Some layers from the model checkpoint at distilbert-base-uncased-finetuned-sst-2-english were not used when initializing TFDistilBertForSequenceClassification: ['dropout_19']
- This IS expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some layers of TFDistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased-finetuned-sst-2-english and are newly initialized: ['dropout_1344']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [71]:
result = nlp("I am happy today but not in mood to work")[0]

In [72]:
print(f"label: {result['label']}, with score: {round(result['score'], 4)}")

label: NEGATIVE, with score: 0.6907


## NER

In [73]:
from transformers import pipeline

nlp = pipeline("ner")
sequence = """ Samsung Inc. is south korean multinational conglomerate headquartered in  Seoul"""


Some layers from the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing TFBertForTokenClassification: ['dropout_147']
- This IS expected if you are initializing TFBertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFBertForTokenClassification were initialized from the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertForTokenClassification for predictions without further training.


In [74]:
print(nlp(sequence))

[{'word': 'Samsung', 'score': 0.9997475147247314, 'entity': 'I-ORG', 'index': 1, 'start': 1, 'end': 8}, {'word': 'Inc', 'score': 0.9995846152305603, 'entity': 'I-ORG', 'index': 2, 'start': 9, 'end': 12}, {'word': 'k', 'score': 0.8019833564758301, 'entity': 'I-MISC', 'index': 6, 'start': 23, 'end': 24}, {'word': '##ore', 'score': 0.36128857731819153, 'entity': 'I-MISC', 'index': 7, 'start': 24, 'end': 27}, {'word': '##an', 'score': 0.7124720811843872, 'entity': 'I-MISC', 'index': 8, 'start': 27, 'end': 29}, {'word': 'Seoul', 'score': 0.9991025328636169, 'entity': 'I-LOC', 'index': 13, 'start': 75, 'end': 80}]


## Summarization

In [78]:
ARTICLE = """ 
The Samsung Group[3] (Korean: 삼성) is a South Korean multinational conglomerate headquartered in Samsung Town, Seoul.[1] It comprises numerous affiliated businesses,[1] most of them united under the Samsung brand, and is the largest South Korean chaebol (business conglomerate).

Samsung was founded by Lee Byung-chul in 1938 as a trading company. Over the next three decades, the group diversified into areas including food processing, textiles, insurance, securities, and retail. Samsung entered the electronics industry in the late 1960s and the construction and shipbuilding industries in the mid-1970s; these areas would drive its subsequent growth. Following Lee's death in 1987, Samsung was separated into five business groups – Samsung Group, Shinsegae Group, CJ Group and Hansol Group, and Joongang Group. Since 1990, Samsung has increasingly globalised its activities and electronics; in particular, its mobile phones and semiconductors have become its most important source of income. As of 2020, Samsung has the 8th highest global brand value.[4]
"""

In [79]:
from transformers import TFAutoModelWithLMHead, AutoTokenizer

model = TFAutoModelWithLMHead.from_pretrained("t5-base")
tokenizer = AutoTokenizer.from_pretrained("t5-base")

 # T5 uses a max_length of 512 so we cut the article to 512 tokens.
inputs = tokenizer.encode("summarize: " + ARTICLE, return_tensors="tf", max_length=512)

outputs = model.generate(inputs, max_length=150, min_length=40, length_penalty=2.0, num_beams=4, early_stopping=True)

All model checkpoint layers were used when initializing TFT5ForConditionalGeneration.

All the layers of TFT5ForConditionalGeneration were initialized from the model checkpoint at t5-base.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFT5ForConditionalGeneration for predictions without further training.
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


In [80]:
print(tokenizer.decode(outputs[0]))

<pad> the Samsung Group is the largest south Korean chaebol (business conglomerate) it was founded by Lee Byung-chul in 1938 as a trading company. the group entered the electronics industry in the late 1960s and the construction and shipbuilding industries in the mid-1970s.


## Transalation

In [83]:
from transformers import TFAutoModelWithLMHead, AutoTokenizer

model = TFAutoModelWithLMHead.from_pretrained("t5-base")
tokenizer = AutoTokenizer.from_pretrained("t5-base")

inputs = tokenizer.encode("translate English to Spanish: hi how are you today", return_tensors="tf")
outputs = model.generate(inputs, max_length=40, num_beams=4, early_stopping=True)

All model checkpoint layers were used when initializing TFT5ForConditionalGeneration.

All the layers of TFT5ForConditionalGeneration were initialized from the model checkpoint at t5-base.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFT5ForConditionalGeneration for predictions without further training.


In [84]:
print(tokenizer.decode(outputs[0]))

<pad> Hallo, wie bist du heute?


# Text Generation

In [85]:
from transformers import pipeline

text_generator = pipeline("text-generation")
print(text_generator("Thanks for your ", max_length=5, do_sample=False))


All model checkpoint layers were used when initializing TFGPT2LMHeadModel.

All the layers of TFGPT2LMHeadModel were initialized from the model checkpoint at gpt2.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFGPT2LMHeadModel for predictions without further training.
Setting `pad_token_id` to 50256 (first `eos_token_id`) to generate sequence


[{'generated_text': 'Thanks for your support!'}]
